## The problem with the previous version of Intel's library benchmark

This short post is an update for the previous blog entry. Most likely,
the Intel compiler does not produce the code that is 100x faster, at least, not under normal circumstances. The updated benchmark is posted on the GitHub. Note that the explanation below is just my guess, I will be happy to hear an alternative version.

It looks like the Intel compiler is super-clever and can dynamically adjust accuracy of computation. Consider the following code:

    float sum = 0;    for (int j = 0; j < rep; ++j) {        for (int i = 0; i < N*4; i+=4) {            sum += exp(data[i]);            sum += exp(data[i+1]);            sum += exp(data[i+2]);            sum += exp(data[i+3]);        }    }

Here the sum becomes huge very quickly. Thus, the result of calling exp becomes very small compared to sum. It appears to me that the code built with the Intel compiler does detect this situation. Probably, at run-time. After this happens, the function exp is computed using a very low-accuracy algorithm or not computed at all. As a result, when I ran this benchmark on a pre-historic Intel Core Duo 2GHz, I was still able to crunch billions of exp per second, which was clearly impossible. Consider now the following, updated, benchmark code:

    float sum = 0;    for (int j = 0; j < rep; ++j) {        for (int i = 0; i < N*4; i+=4) {            sum += exp(data[i]);            sum += exp(data[i+1]);            sum += exp(data[i+2]);            sum += exp(data[i+3]);        }       sum /= float(N*4); // Don't allow sum become huge!    }

Note line 9. It prevents sum from becoming huge. Now, we are getting more reasonable performance figures. In particular, for single-precision values, i.e., floats,
the Intel compiler produces a code that is only 10x faster compared to code produced by the GNU compiler. It is a large difference, but it is probably due to using SIMD
extensions for Intel.

## How fast are our math libraries?

The GNU C++ compiler produces efficient code for multiple platforms. The Intel compiler is specialized for Intel processors and produces even more efficient, but Intel-specific code. Yet, as some of us know, one does not get more than 10-20% improvement in most cases by switching from the GNU C++ compiler to the Intel compiler.
(The Intel compiler is free for non-commercial uses.)

There is at least one exception: programs that rely
heavily on a math library. It is not surprising that users of the Intel math library often enjoy almost a two-fold speedup over the GNU C++ library when they explicitly employ highly vectorized, linear algebra, and statistical functions. What is really amazing is that ordinary mathematical functions such as exp, cos, and sin can be computed 5-10 times faster by the Intel math library, apparently, without sacrificing accuracy.

Intel claims a 1-2 order-magnitude speedup for all standard math functions. To confirm this, I wrote simple benchmarks. They run on a modern Core i7 (3.4 GHz) processor in a single thread. The code (available from GitHub) generates random numbers that are used as arguments for various math functions. The intent of this is to represent plausible argument values. Intel also reports performance results for “working” argument intervals and admits that it can be quite expensive to compute functions (e.g., trigonometric) accurately for large argument values.

For single-precision numbers (i.e., floats), the GNU library is capable of computing only 30-100 million mathematical functions per second, while the Intel math library completes 400-1500 million of operations per second. For instance, it can do 600 million exponentiations or 400 million computations of the sine function (with single-precision arguments). This is slower than Intel claims, but still an order of magnitude faster than the GNU library.

Are there any accuracy tradeoffs? The Intel library can work in the high accuracy mode, in which, as Intel claims, the functions should have an error of at most 1 ULP (unit in the last place). Roughly speaking, the computed value may diverge from the exact one only in the last digit of mantissa. For the GNU math library, the errors are known to be as high as 2 ULP (e.g., for the cosine function with a double-precision argument). In the lowest accuracy mode, additional order-of-magnitude speedups are possible. It appears that the Intel math library should be superior to the GNU library in all respects. Note, however, that I did not verify Intel accuracy claims and I appreciate any links in regard to this topic. In my experiments, to ensure that the library works in the high-accuracy mode, I make a special call:

vmlSetMode(VML_HA);

I came across the problem of math-library efficiency, because I needed to perform many exponentiations to integer powers. There is a well-known approach called exponentiation by squaring and I hoped that the GNU math library implemented it efficiently. For example, to raise x to the power of 5, you first compute the square of x using one multiplication, and another multiplication to compute x4 (by squaring x2). Finally, you can multiply x4 by x and return the result. The total number of multiplications is three. Note that the naive algorithm would need four multiplications.

The function pow is overloaded, which means that there are several versions that serve arguments of different types. I wanted to ensure that the correct, i.e., the efficient version was called. Therefore, I told the compiler that the second argument is an unsigned (i.e., non-negative) integer as follows:

y = pow(x, (unsigned)5);

Big mistake! For the reason, which I cannot fully understand, this makes both compilers (Intel and GNU) choose an inefficient algorithm. As my benchmark shows (module testpow), it takes a modern Core i7 (3.4 GHz) processor almost 200 CPU cycles to compute x5. This is ridiculously slow, if you take into account that one multiplication can be done in one cycle (or less if we use SSE).

So, the following handcrafted implementation outperforms the standard pow by an order of magnitude (if the second argument is explicitly cast to unsigned):

float PowOptimPosExp0(float Base, unsigned Exp) {    if (!Exp) return 1;    float res = Base;    --Exp;     while (Exp) {        if (Exp & 1) res *= Base;         Base *= Base;        Exp >>= 1;    };     return res;};

If we remove the explicit cast to unsigned, the code is rather fast even with the GNU math library:

int IntegerDegree=5;pow(x, IntegerDegree);

Yet, my handcrafted function is still 20-50% faster than the GNU pow.

Turns out that it is also faster than the Intel's version. Can we make it even faster? One obvious reason for inefficiency are branches. Modern CPUs are gigantic conveyor belts that split a single operation into a sequence of dozens (if not hundreds) of micro operations. Branches may require the CPU to restart the conveyor, which is costly. In our case, it is beneficial to use only the forward branches. Each forward branch represents a single value of an integer exponent and contains the complete code to compute the function value. This code "knows" the exponent value and, thus, no additional branches are needed:

float PowOptimPosExp1(float Base, unsigned Exp) {    if (Exp == 0) return 1;    if (Exp == 1) return Base;    if (Exp == 2) return Base * Base;    if (Exp == 3) return Base * Base * Base;    if (Exp == 4) {        Base *= Base;        return Base * Base;    }     float res = Base;     if (Exp == 5) {        Base *= Base;        return res * Base * Base;    }     if (Exp == 6) {        Base *= Base;        res = Base;        Base *= Base;        return res * Base;    }     if (Exp == 7) {        Base *= Base;        res *= Base;        Base *= Base;        return res * Base;    }     if (Exp == 8) {        Base *= Base;        Base *= Base;        Base *= Base;        return Base;    }   …}

As a result, for the degrees 2-16, the Intel library performs 150-250 operations per second, while the customized version is capable of making 600-1200 exponentiations per second.

Acknowledgements: I thank Nathan Kurz and Daniel Lemire for the discussion and valuable links; Anna Belova for editing the entry.

## Early life of dynamic programming (Part I)

Many software developers and computer scientists are familiar with a concept of dynamic programming. Despite its arcane and intimidating name, dynamic programming is a rather simple technique to solve complex recursively defined problems. It works by reducing a problem to a bunch of overlapping subproblems each of which can be further processed recursively. The existence of overlapping subproblems is what differentiates dynamic programming from other recursive approaches such as divide-and-conquer. An ostensibly drab mathematical issue, dynamic programming has a remarkable history.

The approach originated from discrete-time optimization problems studied by R. Bellman in 1950s and was later extended to a wider range of tasks, not necessarily related to optimization. One classic example is the Fibonacci numbers F= Fn-1 + Fn-2. It is straightforward to compute the Fibonacci numbers one by one in the order of increasing n, as well as to memorize the results on the go. In that, computation of Fn-2 is a shared subproblem whose solution is required to obtain both Fn-1 and Fn. Clearly this is just a math trick unrelated to programming. How come was it named so strangely?

There is a dramatic, but likely untrue, explanation given by R. Bellman in his autobiography:

"The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was secretary of Defense, and he actually had a pathological fear and hatred of the word ‘research’. I'm not using the term lightly; I'm using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term ‘research’ in his presence. You can imagine how he felt, then, about the term ‘mathematical’ …  Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation.

What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word ‘programming’. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying—I thought, let’s kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely ‘dynamic’, in the classical physical sense … Thus, I thought ‘dynamic programming’ was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."

This anecdote, though, is easy to disprove. There is published evidence that the term dynamic programming was coined in 1952 (or earlier), whereas Wilson became the secretary of defense in 1953. Wilson held a degree in electrical engineering from Carnegie Mellon. Before 1953, he was a CEO of a major techology company General Motors, while at early stages of his career he supervised development of various electrical equipment. It is, therefore, hard to believe that this man could trully hate the word "research". (The observation on the date mismatch was originally made by Russell & Norvig in their book on artificial intelligence.)

Furthermore, linear programming (which also has programming in its name), appears in the papers of G. Dantzig before 1950. A confusing term "linear programming", as Dantzig explained in his book, was based on the military definition of the word "program", which simply means planning and logistics. In mathematics, this term was adopted to denote optimization problems and gave rise to several names such as integer, convex, and non-linear programming.

It should now be clear that the birth of dynamic programming was far less dramatic: R. Bellman simply picked up the standard terminology and embellished it with the adjective "dynamic", to highlight the temporal nature of the problem. There was nothing unusual in the choice of the word "dynamic" either: The notion of dynamic(al) system (a system with time-dependent states) comes from physics and was widely used already in the 19th century.

Dynamic programming is very important to computational biology and approximate string searching. Both domains use string similarity functions that are variants of the Levenshtein distance. The Levenshtein distance was formally published in 1966 (1965 in Russian). Yet, it took almost 10 years for the community to fully realize how dynamic programming could be used to compute the string similarity functions. This is an interesting story and it is a topic of my next post.

Edited by Anna Belova