Software backward compatibility, undocumented APIs and importance of history etc.

Software morphs. Market realities, changes in technology, adding new features or removing something that is no longer needed, refactoring etc. are some of the valid reasons for initiating a change. However, this is a very costly affair where large software are concerned.

LARGE SOFTWARE AND IMPORTANCE OF KNOWING ITS HISTORY

The software I am employed to work on is humongous. I am talking millions of SLOC. How huge is that? Two top Google sources[1, 2] reveal the number of words, yes the word-count and not the line-count, of some of the most acclaimed large novels:

311,596 – The Fountainhead – Ayn Rand
316,059 – Middlemarch – George Eliot
349,736 – Anna Karenina – Leo Tolstoy
364,153 – The Brothers Karamazov – Fyodor Dostoyevsky
365,712 – Lonesome Dove – McMurtry, Larry
418,053 – Gone with the Wind – Margaret Mitchell
455,125 – The Lord of the Rings – J. R. R. Tolkien
561,996 – Atlas Shrugged – Ayn Rand
587,287 – War and Peace – Leo Tolstoy
591,554 – A Suitable Boy – Vikram Seth

Each line may contain, say, 10 words. This puts the line-count of the above novels between approximately 30,000 and 60,000. I know comparing a novel with a software is not exactly logical, but it is a funny comparison, nevertheless.

I have all these beasts(except the one crossed) at home. Reading them is an undertaking in itself. The software I talk about do not have the emotional or philosophical underpinnings that these novels contain. No meandering yet arresting plot. The software I talk about is pure logic. Yes, certain logical blocks may meander leading to performance hits, but they are located in the source code and culled.

In large software, you cannot ignore history. No.Every line that makes no sense to you may have a historical reason for its existence. The reason may be  valid or otherwise. It is the job of the guy changing the line to make sure that he is not awakening an ancient beast. I have seen code written to workaround a .NET bug, but the developer might have forgotten to add a comment near the code block because he added one to the change-set during checking-in in RTC Jazz or whatever source control. Or, his comment may not make much sense because he had poor writing skills or he was plain lazy. Or, he might have forgotten to add a comment anywhere. There is not much one can do in such a situation other than:

  • Scan the history of the file(s) in the version control system.
  • Do your basic research and ask the developer. If he has left the company, ask other experts. Doing research reduces burden on others and they become more willing to help.
  • If nobody seems to have any idea, perform the change in your private branch and make sure that nothing unintended occurs – a proper regression testing.

WHY MICROSOFT OFFICE FORMATS ARE SO COMPLICATED AND OTHER THINGS

Microsoft Office too is an extremely large and complex software. LibreOffice has a hard time keeping up with it. All moving targets are difficult to shoot at. Imagine shooting at a moving target blind-folded and you will start to appreciate how hard folks at LibreOffice work. Joel Spolsky tries to explain

A normal programmer would conclude that Office’s binary file formats:

  • are deliberately obfuscated
  • are the product of a demented Borg mind
  • were created by insanely bad programmers
  • and are impossible to read or create correctly.

The reasons he mentions for the complexity are:

  • They were designed to use libraries
  • They were not designed with interoperability in mind

    The assumption, and a fairly reasonable one at the time, was that the Word file format only had to be read and written by Word

  • They have to reflect all the complexity of the applications
  • They have to reflect the history of the applications

    A lot of the complexities in these file formats reflect features that are old, complicated, unloved, and rarely used. They’re still in the file format for backwards compatibility, and because it doesn’t cost anything for Microsoft to leave the code around. But if you really want to do a thorough and complete job of parsing and writing these file formats, you have to redo all that work that some intern did at Microsoft 15 years ago.

So, LibreOffice suffers because interoperability  was not a concern for Microsoft. This happens in all fields because you cannot design anything that can cover all bases. I think, we humans are not capable enough. Also, maintaining backward compatibility makes it difficult to shed the historical baggage.

All this add to the complexity. Then there are dark forces that cannot be ignored:

Embrace, extend, and extinguish“,[1] also known as “Embrace, extend, and exterminate“,[2] is a phrase that the U.S. Department of Justice found[3] that was used internally by Microsoft[4] to describe its strategy for entering product categories involving widely used standards, extending those standards with proprietary capabilities, and then using those differences to disadvantage its competitors.

UNDOCUMENTED API AND BACKWARD COMPATIBILITY

An application developer should not get acquainted with what is popularly known as  Undocumented APIs:

First of all: undocumented API is a wrong term. API means Application Programming Interface. But the function calls that usually get the title undocumented API are not intended to be used to program against. A better name would be undocumented program internal interface or short undocumented interface. That’s the term I will use here in this article.

These are APIs that are supposed to be used only by Microsoft and can be removed anytime. However, some of these are very powerful APIs and power corrupts.

Raymond Chen also describes how difficult it is to get rid of Undocumented APIs once it gets used in some popular software:

Suppose you’re the IT manager of some company. Your company uses Program X for its word processor and you find that Program X is incompatible with Windows XP for whatever reason. Would you upgrade?

Charles Petzold also explains why switching standards is so hard:

The computer industry has always been subject to upheavals in standards, with new standards replacing old standards and everyone hoping for backward compatibility.

THE COST OF SOFTWARE CHANGE

Eric Lippert explains the prohibitive cost of fixing a bug or add a new feature:

That initial five minutes of dev time translates into many person-weeks of work and enormous costs, all to save one person a few minutes of whipping up a one-off VB6 control that does what they want.Sorry, but that makes no business sense whatsoever.

IBM’s research shows us the following:
Figure 3: IBM System Science Institute Relative Cost of Fixing Defects 

When fixing a bug or adding a feature can be so costly on complex SW systems, carrying historical baggage makes more sense than switching standards.

CONCLUSION

When trying to fix a bug, enhancing software or adding a new feature:

  • Do not forget history.
  • Do not forget the cost involved.
  • Do not forget that we all are humans and mistakes are unavoidable. Be humble and move on.

Software Engineering – Drawing inspiration from Nature’s bag of tricks

I tried to acquaint the beginner C# programmer with the mysterious concept of delegates in a previous article of mine. If you are a beginner, make sure you read that first. Also, make sure that you read everything linked in all my articles.

My articles get auto-published on CodeProject (RSS magic) and that article was liked by many people there.There is no such thing as “feedback overdose“. I was not content with CodeProject feedback and posted the aforementioned article on Reddit too. I got a few comments there too.

This article is basically a commentary on the comments I received on Reddit. The comments were rudimentary. Nothing special. However, I thought it would be best if I go ahead and dissect certain comments so that the world of knowledge encapsulated in them could be revealed.

When it comes to programming (or learning, in general), I know being content is detrimental. There are people willing to learn, and in the process, willing to force those who can teach to go beyond their comfort zone. There are people who can teach and are willing to go the extra mile to help the community. Contentment would have limited both the groups (learners and teachers) to mediocrity.

LANGUAGE DESIGN CRITICISM AND ALL THAT CAN BE LEARNED FROM IT

Of the comments I got on Reddit, the one that I think is the most significant started off like this

Delegates is one of those features I’m not sure should’ve been added to C# at all. On one hand they help you write code faster, but they are so error prone I’m not sure they’re worth it at all.

This is an opinion. An opinion shared by many, and most significantly by the Java designers and many Java developers. There is nothing wrong with this opinion. The only thing is that it differs from the opinion of C# designers and many C# developers.

If you are a beginner, you may get lost in the syntax jungle or drown in the sea of jargon even before you think of your first solo voyage. Software Engineering can only make sense if one builds mental models. The aforementioned comment is significant because it lets me explain how important it is to understand Software Engineering by relating it to the world around us.

ORIENT YOUR BRAIN TO OBJECT-ORIENTED PROGRAMMING (OOP)

EXAMPLE 1

Who do you think invented the first camera? No, it was not a human. For the lack of a better term and more importantly for the lack of knowledge we conceal that particular detail in the term “Mother Nature”. This allows us to investigate how “Mother Nature” works while not bothering about why she works the way she does. What happens if we do not do this?

Well, let us conjure up a situation. Let us say, you want to apply some lubricant to your bike’s chain. The job is very simple. The algorithm is:

  1. Find the type of chain you have on your bike
  2. Find the best lubricant for that type of chain (most suitable compound)
  3. Purchase the lubricant and apply it on the chain

You do not go about understanding how the lubricant is manufactured, where the materials were sourced from, the quantum mechanical properties of the atoms involved in the lubricant, et cetera. You will end up spending the rest of your life finding the lubricant (reminds me of Buridan’s donkey) while your bike gets eaten by the elements. This is where abstraction helps us. It gives us a good enough starting point. This technique is employed in all fields of human endeavor and is quite popular too. It can be found in Mathematics, in Computer Science et cetera. In OOP, we use abstraction to hide away details. You never need to know the implementation details of, say, delegates. All you need to know is how to use the concept to achieve your goal.

Now, let us get back to our original example. We do not know the true nature of the designer of the first camera. Also, I do not know what the first version of the camera on earth looked like, but the first camera design after forking and a few versions looks like the figure on the left. The original eye design was forked to create human eye. The development continued and what we see today is the current state of evolution. There is more detail to the development/evolution. The existence of various types of human eyes different from each other in various ways (eye color, for example) is clear evidence of a complex development process. This means various development/evolution branches of human eye design exist and are under continuous development. The design is bound to change in the future. New features may be added. Some may be removed. The changes may be governed by variables completely out of our control. Probably, these variables are completely out of our genes’ control too. Nevertheless, changes will occur. Our eyes have a well-defined set of features that work in the environment we inhabit. This set of features is mutable. Changes in environment will force changes in the design of our eyes.

Modified by CombineZP

The original camera or eye was again forked to create a fly’s eye. This design too has been under development for a very long time now. However, it is entirely different in design and has a very different set of features. However, it too  is a slave of the environment and has to undergo continuous change to remain relevant. That is how it has been, that is how it is and that is how it will always be. It is the environment that forces changes in the design. The designer is forced to adapt according to the environment’s dictate or perish.

Can you see a connection between the real world and Software Engineering? The environment we live in and the market that we expose our software to are exactly as harsh as the other. They follow the same principles, most important of which is

Survival of the fittest

As you can see from the above pictures that even nature could not settle on one design and has been experimenting. The original eye design was further forked to create black-and-white eyes, eyes that can view in the dark, eyes that can see underwater so on and so forth. Thus, there is no one design that could termed as the best. It all depends on what the use case is.

WHAT TO LEARN NEXT?

  • Software Development Lifecycle
  • Software Configuration Management
  • Build and Release Management
  • Branching Schemes source control tools
  • What is abstraction? (in and outside the domain of OOP)

EXAMPLE 2

Another very interesting subject is that of DNA Replication. Cell replication involves DNA replication.

DNA is made up of two strands and each strand of the original DNA molecule serves as a template for the production of the complementary strand, a process referred to as semiconservative replication. Cellular proofreading and error-checking mechanisms ensure near perfect fidelity for DNA replication.

Does this not sound like parsing, syntax checking, lexical analysis et cetera done during the compilation of a programming language?

Although the algorithm is very complex, the output is very reliable. The DNA syntax and semantics can be considered immutable during the normal DNA replication process.

The rate of semiconservative DNA replication in a living cell was first measured as the rate of phage T4 DNA strand elongation in phage-infected E. coli.[4] During the period of exponential DNA increase at 37°C, the rate of strand elongation was 749 nucleotides per second. The mutation rate per base pair per round of replication during phage T4 DNA synthesis is 2.4 x 10−8.[5] Thus semiconservative DNA replication is both rapid and accurate.

DNA_replication_en.svg

However, this reliability in output and the immutability of DNA syntax and semantics is very much dependent on certain factors. For example, Gamma radiation is very well known to mutate DNA. There are other biological agents too that are capable of changing the semantics of a DNA.

Meiosis is another form of cell division. This is a very special cell division process. The input in this case is paternal and maternal DNAs. The new cell formed has a new DNA inherited from both parents. In this case, however, syntax of the DNA is the thing that is of primary concern. Semantics are not that important.

If the child DNA is syntactically correct, it is allowed to exist. There are mechanisms in cells that test the syntax. It is only when you look at the output that you can determine whether everything was proper.

The picture on the left is an example of meiosis happening because the resulting syntax was proper, but something went wrong semantically. What went wrong could only be determined after the birth of the specimen.

Not all such semantic deviations from normal are bad. In certain situations these prove to be a boon for the species. Just take a look at the Grizzly bear and Polar bear ancestry or dog and wolf relation. These bears are genetically so similar that they can mate. Yet, they are very different. Same is the case with dogs and wolves.

This scenario is very similar to runtime errors in computer programs. The program compiles fine because the program is syntactically correct. However, the compiler is not well-equipped to check the semantics.

Logical errors in the program slip through and can only be caught during testing or worse in the field. Let me show you a code sample that illustrates my point:

class Program
{
    static void Main(string[] args)
    {
        int input = 0;
        int.TryParse(Console.ReadLine(), out input);
        int answer = 15 / input;
        Console.WriteLine(answer.ToString());
        Console.ReadLine();
     }
}

Yes, I know there are various ways to improve the code above so that it works as expected, but that is exactly what  want to say. The code compiles although it has huge gaps. It is only when you run it that you become aware of how wide and deep the gaps are 🙂

The following happens when you run the code and provide zero as input:

dividebyzero_unhandledexception

WHAT TO LEARN NEXT?

  • Various types of Programming languages and their compilation processes
  • Compilers, interpreters
  • Statically typed vs dynamically typed languages
  • Parsers, lexers, regular expression engines
  • Compile-time errors vs runtime errors

EXAMPLE 3

This time from the man-made universe

IC_engine_2IC_engine

The IC engine on the left is an inline 4 cylinder and the right one is a V-6. Now, answer the following

  • Which one is the better?
  • Looking at the engines, can you guess which engine was designed by a smarter engineer?
  • Which of these engines would a smart user choose?

If you try to answer these questions, you will start understanding that there are many design choices available. This is a phenomenon that can be seen in all fields of human endeavor. Weigh the pros and cons and decide which suits the job at hand. And improvise. Some more thought and you will start seeing the common theme in all these examples. “Mother Nature” and humans think alike. Or more correctly, we humans learn from the world around. Same thought process, same design compromises etc. in all fields of human endeavor and nature…

WHAT TO LEARN NEXT?

  • Design patterns

CONFUSING delegate BEHAVIOR

The commenter went on and posted a code snippet that was supposed to show the major drawback of delegates as implemented in C#. He asked

Just by reading this code, what do you think will happen? An exception? Print 5? Print 6?

private Duplicator Hook;
void Main()
{
    Hook += DuplicatorImpl1;
    Hook += DuplicatorImpl2;
    Console.WriteLine(Hook(2));
 }
 static int DuplicatorImpl1(int number) {
    Console.WriteLine("Impl1");
    return number + number + 1;
 }
 static int DuplicatorImpl2(int number) {
    Console.WriteLine("Impl2");
    return number * 3;
 }
 public delegate int Duplicator(int number);

This code has a confusing output

Impl1
Impl2
6

The confusing output is not a bug in the .NET framework. This is an implementation detail. A detail that can only be arrived at if one is willing to go deeper into the concept. You can criticize the implementation saying that it is not intuitive or whatever, but cannot say that the above code and its output is the reason delegates are bad. The following is the correction I made to the above code that fixed the issue and provided the desired output

private static Duplicator Hook;
static void Main(string[] args)
    {
        Hook += DuplicatorImpl1;
        Hook += DuplicatorImpl2;
        foreach (Duplicator hook in Hook.GetInvocationList() ){
            Console.WriteLine(hook(2));
        }
        Console.Read();
    }
    static int DuplicatorImpl1(int number)
    {
        Console.WriteLine("Impl1");
        return number + number + 1;
    }
    static int DuplicatorImpl2(int number)
    {
        Console.WriteLine("Impl2");
        return number * 3;
    }
    public delegate int Duplicator(int number);
}

What is needed is a GetInvocationList(). Now, the output is exactly what we desired

Impl1
5
Impl2
6

The important thing to understand here is that programming languages differ from each other not just because they have differing syntax, rather they differ from each other because their designers were influenced by different schools of thoughts. Software Engineering is not just about technical terms, keywords, syntax, IDEs et cetera. It is as rich and as vast as any other subject.

Language Design (a part of Software Engineering) is a delicate art of balancing purism and pragmatism. It is influenced by many subjects and it influences many. Programming has strong philosophical underpinnings. To ignore it is to clip your own wings.

Any wonder that Sun Microsystems and Microsoft went into a full-fledged virtual war over their differences in tastes.

GOING DEEPER INTO delegate TERRITORY

The comment thread further evolved and many new thoughts were shared

RedditCommentThread

What is going on here? We had delegates. Delegates worked just fine. People could (still can) do a lot of stuff with delegates. Delegates are the backbone of the event pattern in C#. They were more than enough. Why did things like Action and Func come into existence? How do they benefit the user?

Using a plain old delegate involved the following steps:

  1. Declare the delegate
  2. Instantiate the delegate and assign it to the appropriate function
  3. Call the delegate

Every single person in the C# universe has to do all of the above. There is a lot of verbosity here. This verbosity is also known as boilerplate code

In computer programming, boilerplate code or boilerplate is the sections of code that have to be included in many places with little or no alteration. It is often used when referring to languages that are considered verbose, i.e. the programmer must write a lot of code to do minimal jobs.

Action, literally, is a “thing done or an act”. Action delegate, thus, encapsulates a method that has no parameters and does not return a value. There are Generic versions of Action also available. If you have no idea what Generics is, fret not. For now, just look at how a List in C# is used and follow the same. Do not forget to learn Generics as soon as possible. They are wonderful (maybe, I will write about them later). Func delegate, on the other hand, is analogous to Mathematical functions in that they encapsulate functions that take one or more input parameters and return a value. The most basic Func<TResult>, however, is an exception and encapsulates functions that do not take parameters or inputs. All other versions take one or more input parameters and return a value. In the simplest of terms Action and Func are basically syntactic sugar

In computer science, syntactic sugar is syntax within a programming language that is designed to make things easier to read or to express. It makes the language “sweeter” for human use: things can be expressed more clearly, more concisely, or in an alternative style that some may prefer.

Syntactic sugar is what you apply when boilerplate code starts making your code bland. This may be provided as a feature in the library or framework you use or you may yourself, with some experience under your belt, go ahead and manufacture various flavors of syntactic sugar.

CONCLUSION

Everything we do is based on what we have learnt from the world around us. A Software Engineer works in a domain that is more abstract than most fields of human endeavour. There is almost nothing tangible when we start working on a project. Nature’s repository of tricks is what we can bank upon.

Make mental models by observing the world around you. Have models ready before you start making your way through the syntax jungle.

Read Biology texts (just look at how piankillers work), Mathematics, Mechanics…

Project Euler problem – Multiples of 3 and 5; Small yet insightful (Part 1)

PERSPECTIVE

Programming may not come naturally to everybody, but a little perspective can change things.

Familiarity with the most basic C# programs is assumed. No object-oriented concepts are needed to understand Part 1. If you are absolutely new to C#, it would be beneficial to go through, at least, the following. That is all I ask:

Hello World in C#

This post if for those new to programming and looking for a little perspective. If you can make C# dance to your tunes, do not fret. You can still read and help me improve this post.

PerspectiveRatatouille

It is always wonderful to see a small and simple problem, when properly handled, reveal so much beauty that it is almost impossible to believe.

There was a time when programming made no sense to me. I had no idea how to do it. That is when I found a very simple problem and I was enlightened:

Find if a given number is prime or not

Now, there is not much substance in this problem. So it seems till you poke it and a legion of mathematical beasts are unleashed to challenge your intellect. Number theorists of all ages have struggled to tame these beasts. They still roam about in the wilderness of Mathematics.

THE PROBLEM

This Friday, I was looking for some fun. I created an account on Project Euler . The first question that was thrown at me was:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 or 5 below 1000.

THE SOLUTIONS

This problem seems simple. It is if we have great computing power at our disposal. It is not if we want to properly tame it.

THE NAIVE SOLUTION


using System;

namespace ProjectEuler_Question_1
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine(&quot;Sum of all numbers below 1000 that are multiples of 3 or 5 are: &quot;);
            Console.WriteLine(FindSumOfMultiplesOf3And5().ToString());
            Console.Read();
        }

        static double FindSumOfMultiplesOf3And5()
        {
            double sum = 0;
            for (double i = 0; i &lt; 1000; i++)
            {
                /* % is the modulus operator */
                if ((i % 3 == 0) || (i % 5 == 0))
                {
                    sum = sum + i;
                }
            }
            return sum;
        }
    }
}

This is a solution. It works. The major elements of any computer programme are present:

  1. Data
    – do some operation on numbers below 1000
  2. Data Structure
    – data types used; double, in the code above, to store numbers
  3. Algorithm
    The loop, the modulus operator %, the summation

WhatIsProgramming

Why is this naive, then?
It is naive because of:

Inefficient use of hardware resources:

  • On a Windows 7 virtual machine with one CPU core and 8 GB RAM allotted, the results are given below:
Limit used Time taken
1000 sub-milliseconds range
100000 sub-milliseconds range
10000000 140.62 milliseconds
1000000000 13972.65 milliseconds ~= 14 seconds
100000000000 1599374.20 milliseconds ~= 26 minutes

Look at the CPU madness as witnessed by KSysGuard (KDE System Monitor) on my openSUSE host running on i7-4770 with 32 GB RAM:

CPUActivity

  • On an i5-4670 physical machine with 4 CPU cores and 32 GB RAM, the results are given below:
Limit used Time taken
100000000000 1407631.13 milliseconds ~= 23.5 minutes

There are a lot of variables that can affect the execution time. On of the most important is the hardware. A Pentium 4 PC would have made me wait for a few hours. So, do not assume that the algorithm would always run on great hardware. Design algorithms in such a way that they may run on low-end hardware too.

There is still one question here

Why would we bother about large inputs when the problem specifically speaks about 1000 as the upper limit?

This is a very important question. The answer is quite large. It will be answered in another part.

Inefficient use of intellectual resources:

If we had infinite computing power, data structures and algorithms would not have mattered at all, but we do not. So, it is imperative that we invest some time in finding a better algorithm for the job. This process can be slow and evolutionary in nature. It can even be sudden, but we have to make it happen.

There is another issue that we have not discussed:

How would we go about solving this problem if we had no computer at our disposal?

Just imagine the pain one would have to go through to find out all the multiples of 3 and 5 below 100000000000.

BETTER SOLUTION

A small investment of time can help us in breaking the problem into smaller pieces. These smaller pieces can be solved without any electronic computing power.

Realizing that we are dealing with a couple of Arithmetic Progressions can help us create better strategies. So, we are dealing with:

  1. Arithmetic Progression with 3 as the first number and a common difference of 3
  2. Arithmetic Progression with 5 as the first number and a common difference of 5
  3. Arithmetic Progression with 15 as the first number and a common difference of 15

Where did that Point 3 come from?

This needs some detailed explanation. Let us build a Truth Table for Logical OR to see where Point 3 comes from:

Multiple of 3 (A)
Multiple of 5 (B)
A OR B
T F T
F T T
T T T
F F F

Pictorially Logical OR could be represented as:

VennDiagram_ProjectEuler_Problem1The Naive Solution used Logical OR to filter the input:

if ((i % 3 == 0) || (i % 5 == 0))
{
    sum = sum + i;
}

What all this means is that:

An input can be a “Multiple of 3” OR a “Multiple of 5” OR both. In all these cases the input will be sieved into the conditional block. This happens ONLY once for any input. Thus, the sum is guaranteed to be the sum of unique numbers.

Now we have to take a look at another Truth Table. This time for Logical AND:

Multiple of 3 (A)
Multiple of 5 (B)
A AND B
T F F
F T F
T T T
F F F

When we break our problem into two Arithmetic Progressions, find their sums and add those sums together to get the final result, a small error creeps in. Imagine that we have two sets of numbers – one containing all the multiples of 3 and another that contains all multiples of 5. So, {3, 6, 9, 12, 15, …} is one set and {5, 10, 15, 20, 25…} another.

If we sum the sets individually, we have to do this:

S3 = 3 + 6 + 9 + 12 + 15 + ...

and

S5 = 5 + 10 + 15 + 20 + 25 + ...

As we can see, there are numbers that are in both the sets.

S3 and S5 are sums of unique numbers, but when we are dealing S3 + S5, certain numbers get added twice. 15, 30, 45 etcetera are examples of such numbers. These numbers form an Arithmetic Progression with 15 as the first number and a common difference of 15.

We may expect the answer of the problem to be

S3 and 5 = S3 + S5

The truth, however, is

S3 and 5 actual > S3 and 5 expected

So, to get rid of the error, we have to do the following:

VennDiagram_ProjectEuler_Problem1_3The above picture shows the filtering we need to do. We do that filtering implicitly when we do the following calculation:

S3 and 5 = S3 + S5 - S15

Now what we expect will be equal to the actual output:

S3 and 5 actual == S3 and 5 expected

Armed with so much knowledge, we are now in a position to attack the problem in a really profitable way. The code with the new algorithm is:

using System;

namespace ConsoleApplication5
{
    class Program
    {
        static double limit = 1000;

        static void Main(string[] args)
        {
            double a = FindSumOfArithmeticProgression(3);
            double b = FindSumOfArithmeticProgression(5);
            double c = FindSumOfArithmeticProgression(15);
            double sum = (a + b - c);
            Console.WriteLine(sum.ToString(&quot;N&quot;));
            Console.Read();
        }

        static double FindSumOfArithmeticProgression(double multiplesOf)
        {
            /* We will use the following formula to calculate the sum of the Progression:
             * n(a + l)/2,
             * where a = first element, l = last element and n = total number of elements in the progression and 2 = well, 2
             * To find n, we have to use:
             * a + (n - 1)d = l =&gt; n = 1 + (l - a)/d
             */
            double a = multiplesOf, d = multiplesOf, l = 0;

            /*Find l */
            l = FindLastElement(multiplesOf);

            /* Find n */
            double n = 1 + (l - a) / d;

            /* Find l */

            /* Find sum */
            double sum = (double) (n * (a + l) / 2);
            return sum;
        }

        private static double FindLastElement(double multiplesOf)
        {
            /* Find the range in which we can find the last multiple
             * In our case, 1000 - 3 or 1000 - 5 is the answer
             */
            double rangeInWhichToFindTheLastMultiple = limit - multiplesOf;
            double l = 0;

            for (double i = rangeInWhichToFindTheLastMultiple; i &lt; limit; i++)
            {
                if (i % multiplesOf == 0)
                {
                    l = i;
                    break;
                }
            }
            return l;
        }
    }
}

This is fast beyond belief. No input is big enough to make it sweat:

Limit used Time taken
1000 sub-milliseconds range
100000 sub-milliseconds range
10000000 sub-milliseconds range
1000000000 sub-milliseconds range
100000000000 sub-milliseconds range

From 26 minutes taken by the old naive algorithm to the sub-milliseconds taken by our new, shiny algorithm, for an input as large as 100000000000, the improvement is unbelievable.

Now that we have brought the execution time down, we are faced with another issue. Yes, we are not done yet. Far from it. The new problem is the output. It is not precise at all

23,333,333,331,666,700,000.00 while it should have been 23,333,333,331,666,666,668.00

What happened? Where did all the good digits go?

Why is the output not precise?

This is because we have not chosen the correct Data Structure. The data type double, chosen for the purpose of holding various numbers, has its limits. It is a 64-bit data type.

Type Approximate range Precision .NET Framework type
double ±5.0 × 10−324 to ±1.7 × 10308 15-16 digits System.Double

Read the System.Double documentation thoroughly to understand what limitations we are dealing with.

FROM double TO decimal

We have to get a better data type if we wish to solve the problem for very large numbers. Let us look at

Type Approximate Range Precision .NET Framework type
decimal (-7.9 x 1028 to 7.9 x 1028) / (100 to 28) 28-29 significant digits System.Decimal

Decimal loses steam when input is 1015.

For anything lesser than that, decimal will work. Decimal is a 128-bit data type. Read System.Decimal documentation for details.

Let us compare the output for double and decimal for 1014 as input

OUTPUT WHEN USING DOUBLE OUTPUT WHEN USING DECIMAL
2,333,333,333,333,320,000,000,000,000.00 2,333,333,333,333,316,666,666,666,668.00

Decimal can handle a really large input, but what if we have an input that is > 1014 ?

FROM decimal TO BigInteger

This is where we go to the data type that is the biggest of them all – The BigInteger.

BigInteger can hold arbitrarily large numbers with one thing to keep in mind

Because the BigInteger type is immutable (see Mutability and the BigInteger Structure) and because it has no upper or lower bounds, an OutOfMemoryException can be thrown for any operation that causes a BigInteger value to grow too large

Now that BigInteger has arrived to rescue us, there is another issue that crops up:

The biggest integer literal that can be defined in C# is ulong with max value of 18,446,744,073,709,551,615

Type Range Size .NET Framework type
ulong 0 to 18,446,744,073,709,551,615 Unsigned 64-bit integer System.UInt64

So anything larger than 18,446,744,073,709,551,615 will make the compiler writhe in pain. It lets us know by spewing out the following exception:

error CS1021: Integral constant is too large

BigInteger has us covered even in such situations. Turning the very large input into a string and using BigInteger.Parse function can help us overcome the issue.


using System;
using System.Numerics;

namespace ConsoleApplication5
{
    class Program
    {
        static BigInteger limit = BigInteger.Pow(10, 80);

        static void Main(string[] args)
        {
            BigInteger a = FindSumOfArithmeticProgression(3);
            BigInteger b = FindSumOfArithmeticProgression(5);
            BigInteger c = FindSumOfArithmeticProgression(15);
            BigInteger sum = (a + b - c);
            Console.WriteLine(sum.ToString(&quot;N&quot;));
            Console.Read();
        }

        static BigInteger FindSumOfArithmeticProgression(BigInteger multiplesOf)
        {
            /* We will use the following formula to calculate the sum of the Progression:
             * n(a + l)/2,
             * where a = first element, l = last element and n = total number of elements in the progression and 2 = well, 2
             * To find n, we have to use:
             * a + (n - 1)d = l =&gt; n = 1 + (l - a)/d
             */
            BigInteger a = multiplesOf, d = multiplesOf, l = 0;

            /*Find l */
            l = FindLastElement(multiplesOf);

            /* Find n */
            BigInteger n = 1 + (l - a) / d;

            /* Find l */

            /* Find sum */
            BigInteger sum = (BigInteger) (n * (a + l) / 2);
            return sum;
        }

        private static BigInteger FindLastElement(BigInteger multiplesOf)
        {
            /* Find the range in which we can find the last multiple
             * In our case, 1000 - 3 or 1000 - 5 is the answer
             */
            BigInteger rangeInWhichToFindTheLastMultiple = limit - multiplesOf;
            BigInteger l = 0;

            for (BigInteger i = rangeInWhichToFindTheLastMultiple; i &lt; limit; i++)
            {
                if (i % multiplesOf == 0)
                {
                    l = i;
                    break;
                }
            }
            return l;
        }
    }
}

The above code uses an arbitrarily large number to show how BigInteger when coupled with BigInteger.Parse can solve our issues.

CONCLUDING PART 1:

I have covered a lot in this part:

  • The basic structure of a program; data structures, algorithms etcetera
  • How execution time depends on hardware (there are other factors not covered)
  • How choice of data structure can affect the program and its output
  • How to choose proper data types for desired results

This is a work-in-progress and changes will be made. Next few parts will cover some advanced stuff.
There is a lot more to do so as to make the program beautifully generic. All that will be done in the next few posts. We have only breached the topmost layer of the beautiful world of Software Engineering or more precisely the Art of Software Engineering 🙂

When we had no computers, we had no programming problem either. When we had a few computers, we had a mild programming problem. Confronted with machines a million times as powerful, we are faced with a gigantic programming problem. – Edsger Dijkstra

Graphics Interchange Format (GIF) – A great tool to kill bugs fast

Problems are the reason we invent and innovate incessantly. Problems are as stubborn as cockroaches. But we humans are at the top of the food chain not without a reason. We reason.

Not all problems can be solved by the lone ranger. In fact, there are very few problems that can be solved without borrowing ideas from history. This is true in all fields. Software Engineering is no exception. Collaboration is the key.

HOW TO FILE A BUG PROPERLY

Providing a detailed sequence of steps, irrespective of the medium used, would help the collaborator understand unambiguously the issue at hand. This allows the collaborator to reproduce the issue consistently. A bug report that does not have the following is a bad/incomplete/a-waste-of-time bug report:

  1. Environment description and logs (this is unavoidable) – the necessary machine hardware details, Operating System details, framework details (.NET, Java, Python), application version, build number etcetera.
  2. Detailed steps-to-reproduce
  3. Exported mail chains (mail clients allow exporting mails which can be attached) – all communications happen via mails. Mails are a treasure trove of information. Why not use them properly?
  4. Screenshots, static images (JPEG, PNG etc.)
    – read further to see a better option

There are many other artifacts that can be attached and the reader should decide whether attaching something would help in getting the issue fixed. The above four items are the most important ones.

Recently, I was asked to fix a bug. The problem was that I was not able reproduce it. The guy reporting the bug was a Hardware Engineer calibrating the hardware using software written by me. I captured some screenshots showing the sequence of steps I executed to reproduce the issue and pasted them in a Word document. I mailed him the document. This was far from ideal solution. I did not like it a bit. Static images pasted on multiple pages of a Word document were pathetic to say the least. This intellectual pain forced me to take a look at other options:

  1. Screencast
  2. Merging individual screenshots to form some sort of motion picture

Screencast is a very handy option. The tools available are really great. But, they can be heavy – on the PC and on the wallet. Their output too can be quite heavy and attaching them to bug reports may not be a good idea. Thus, they are too complex for my tasks Merging individual screenshots to form a silent motion picture seemed a good idea. That is when Graphics Interchange Format (GIF) came to my mind:

Graphics Interchange Format (GIF) images can help in killing software bugs fast. A pictures is worth more than a thousand words. A GIF is worth more than a thousand pictures then.

All I needed was a tool to capture a GIF of my workstation as I went about the business of reproducing the issue.

ENTER LICEcap

Cockos Incorporated produces this excellent tool that can capture GIFs. LICEcap is a very small open source tool. Yes free and open source and small beyond belief (< 1 MB). The guys behind this tool are quite famous:

The employees of Cockos have created and launched many fine products and technologies prior to joining Cockos, including Gnutella, K-Meleon, Kaillera, NSIS, SHOUTcast, and Winamp.

So, trust is not an issue at all. I went ahead and installed the tool. It works in a very simple and intuitive way. No one needs a lesson to know how it works. The moment one opens the application, everything is crystal clear. It surely lacks advanced features but it does the job of capturing the screen very very well.

THEN I FOUND ScreenToGif

This too is simple and intuitive but it has some advanced features that LICEcap lacks. The most significant feature being the ability to edit the recorded GIF. A list of features as borrowed from the tool’s website is:

  • Record your screen and save directly to a gif looped animation.
  • Pause and continue to record.
  • Move the window around to record what you want.
  • You can add Text, Subtitles and Title Frames.
  • Edit the frames, add filters, revert, make yoyo style, change frame delay, add border.
  • Export frames.
  • Crop and Resize.
  • You can work even while the program is recording.
  • Remove frames that you don’t want.
  • Select a folder to save the file automatically or select one before enconding.
  • Add the system cursor to your recording.
  • Very small sized, portable and multilanguage executable.
  • Start/Pause and stop your recording using your F keys.
  • Multi language: Portuguese, Spanish, Romanian, Russian, Swedish, Greek, French, Simplified Chinese, Italian, Vietnamese and Tamil.
  • GreenScreen unchanged pixels to save kilobytes.
  • You can apply actions/filters to selected frames.
  • Fullscreen Recording.
  • Snapshot Mode.
  • Drag and Drop to add frames in the editor.

These tools can help us in our fight against bugs. Further research: