log in | about 

Efficient Intersections of compressed posting lists thanks to SIMD instructions

I co-authored a a paper on efficient intersection of compressed posting lists. I believe our team obtained good improvements over standard approaches using single instruction multiple data (SIMD) commands available on most modern processors. Here I want to briefly explain what was done and why it was hard to achieve these good improvements.

Parallelization of the algorithms is hard, but parallelization using a single CPU core is even harder. One of the first algorithms that exploited parallel capabilities of a single CPU were bit-parallel algorithms. For instance, if you carry out a bitwise AND between two 32-bit words, you essentially perform 32 boolean ANDs between 32 pairs of variables. Bitwise logical operations are, perhaps, the first SIMD instructions implemented by CPU designers. And thanks to Wikipedia, we know that the first bit-parallel algorithms were proposed more than 40 years ago.

More advanced CPUs may provide a rich set of SIMD instructions (far beyond bit-parallel logical operations). In particular, on x86 CPUs, you can efficiently carry out four additions over four pairs of single-precision floating numbers using a single SIMD instruction. Such instructions operate on small vectors and allow us to significantly improve performance of algorithms that process several streams of data in identical fashion, e.g., matrix multiplication algorithms. Somewhat surprisingly, SIMD operations can accelerate a lightweight compression of integers. By essentially splitting the data set into four parts, one can decompress billions of integers per second.

Unfortunately, CPUs provide only a limited set of control flow operations for vectorized data. In particular, x86-compatible CPUs have an instruction _mm_cmpeq_epi32 that checks element-wise equality of two integer vectors (each of which has four elements). However, this is not sufficient to parallelize an algorithm crucially relying on control flow instructions. Imagine that you compared four pairs of integers using _mm_cmpeq_epi32 and obtained the result in the form of a four-element bitmask (element i is 1 if and only if the the i-th pair have equal numbers). How do you extract matching integers? One clearly needs a gather-scatter instruction, but it is not supported by commodity x86 CPUs.

One devious workaround is to use a shuffle operation. To this end, one needs precomputed tables of shuffle masks. For example, if the comparison results can be interpreted as number 15, you retrieve the shuffle mask 15 from the shuffle-mask table and perform the shuffle operation. Accessing memory is not fast, even if data is in L1 cache. As an example, consider a problem of extracting integer values from sets represented as bitmaps. One well-known solution to this problem involves memoization. However, it is about five times slower than other approaches. If you do not believe, check the performance/code of the function bitscan4 in this sample code.

Now consider a C++ implementation of the classic textbook algorithm to carry out intersections of two sorted lists:

  1. while (first1!=last1 && first2!=last2)
  2. {
  3. if (*first1<*first2) ++first1;
  4. else if (*first2<*first1) ++first2;
  5. else {
  6. *result = *first1;
  7. ++result; ++first1; ++first2;
  8. }
  9. }

The algorithm has a complexity of $O(n+m)$ and is optimal in many cases ($m$, $n$ are list sizes) . However, it requires a lot of branches, which are often mispredicted. As a result, this algorithm is painfully slow: We estimate its performance to be about 300-400 million integers per second (the number of integers is computed as the sum of list sizes). Clearly, if you have an algorithm that can read a compressed posting list at the speed of 4 billion integers per second, it is not especially useful when the intersection algorithm is so tardy.

Schlegel et al proposed a neat way to vectorize this text-book algorithm. Their approach relies on all-against-all comparison SIMD instruction _mm_cmpestrm. Again, there is no gather-scatter instruction on x86 CPU. So, result extraction relies on the shuffle operation and requires reading of precomputed shuffle masks. Furthermore, even the counting version of this algorithm, which only evaluates the size of the intersection (without saving a result), is not especially efficient. We estimate that performance of this counting version is around 2 billion integers per second. It is much faster than the classic text-book algorithm, but is still an impediment, if the decompression algorithms has the speed of 4 billion integers per second.

More efficient (well-known) intersection algorithms exploit a differential in posting list size: Posting lists have different lengths and the differences are quite substantial. So, one can iterate over a shorter list and check if an element from the shorter list is present in the second list. For instance, the check can be done using a binary search (or a similar in spirit approach). Binary search relies on branching, thus, SIMD instructions should be useless here, right?

I would think so. Yet, Nathan Kurz figured out that it does not have to be the case here. He made a clever observation: When we are carrying out a binary search, eventually the search interval becomes so small that searching this interval sequentially becomes feasible. And such a sequential search can be accelerated using the instruction _mm_cmpeq_epi32 (that compares 4 pairs of integers at a time). This sequential scan can be carried out in a branchless fashion: Comparison results (one result is a bit mask) can be easily aggregated using the SIMD instruction _mm_or_si128. Furthermore, there is no extraction problem here. We iterate over the short list one integer at a time. Whenever, the integer is found in the second list, we memorize this common integer (which we already have in a scalar variable/register). Otherwise, we simply proceed to check the next one.

Nathan and Daniel (I wish I could claim credit here) turned this idea into an efficient intersection algorithm that beats the scalar checking version by a good margin. The performance can be further improved by storing some of the posting lists as bitmaps (the idea proposed by Culpepper and Moffat). As a result of these combined improvements, we achieve a sub-millisecond query-processing time for Gov2. For ClueWeb09 the query-processing time can be less than 2 milliseconds. Yet, I omit further details here and refer the reader to our joint paper.

In conclusion, our approach is based on a simple idea that was hard to come by. It was somewhat counter-intuitive that this simple method would work better than the approach due Schlegel et al, which employs the all-against-all comparison SIMD instruction. In fact, I think that the latter is quite useful when we intersect or merge lists of comparable sizes. However, this is probably not a typical scenario for a textual search engine.

Main memory is similar to a hard drive

Today there was a first class of the large-scale machine learning course. If you run things at scale, you need methods with (near) linear run-time, i.e., $O(n)$ or $O(n \log(n))$. Not all big-O-es are created equal and constants do matter. Our instructor wanted to illustrate this statement. To this end, he used a well-known chart due to Peter Norvig and Jeff Dean, which highlights latencies of different memory types. As everybody remembers, computer memories are organized in a hierarchy. On top of this hierarchy is the fastest, but most expensive memory, whose size is small. At the bottom, there is a very cheap memory, e.g., a rotational or solid-state drive (SSD), which works slowly. Thus, reading $O(n)$ bytes from the hard drive is not the same as reading $O(n)$ bytes from the main memory: There is an orders-of-magnitude difference.

It is a rather well-known fact that the disk-based memory (even in the case of the SSD) is very slow. In fact, in many cases, it takes shorter time to read compressed data from the disk (and subsequently decompress it) rather than to read the uncompressed data. Another well-known fact is that the gap between the fastest volatile memory (CPU registers or L1 cache) and the slowest volatile memory is huge: A truly random read from main memory can cost you hundreds of CPU cycles. How many exactly?

The numbers in the Norvig-Dean's chart are a bit outdated, so I wrote a simple application to test memory speeds. I have a Core i7-4700MQ processor with the top frequency of 3.4Ghz. My computer has 16 GB of DDR3 memory. By running (use 12 GB of memory):

./test_mem_latency 12 0

I obtain that a random-access takes about 62 nanoseconds or approximately 200 CPU cycles. In contrast, it takes 0.3 nanoseconds to read one byte sequentially. These measurements are in line with tests made by other folks.

You can see that main memory is similar to the rotational hard drive in terms of performance: There is an orders-of-magnitude difference between the sequential and random-access reading speed. Furthermore, even though it is far less known, compressed data can also be read from memory faster than uncompressed data! For commodity servers, memory bandwidth rarely exceeds 10-20 GB/sec, and it is not hard to exhaust it if you have dozens of cores. A bit more surprising is the fact that you can read compressed data faster than uncompressed even if you employ only a single CPU core!

Main memory now resembles a hard drive. Thus, one benefits from using optimization techniques previously tailored to hard drives. One obvious optimization technique is caching. To this end, memory is divided into chunks. When we access a chunk that is not in the cache, a cache miss happens. This instructs a CPU to load this chunk into the cache.

There is also a division of memory into pages. Nowadays, programs access pages using logical addresses rather physical ones. Clearly, the CPU needs to translate logical addresses to physical addresses through a special translation table. Even if place such a table into a "fast" L1/L2 cache, a translation operation will be costly. This is why the CPU additionally employs a Tranlslation Lookaside Buffer. This buffer is a fast, but small cache. When the CPU is able to translate the page address using the TLB cache, we are lucky. Otherwise, an expensive TLB miss happens.

If we use huge pages, fewer translations (from logical to physical address) are required. We can provide efficient address translation for larger regions of memory, and, consequently, fewer TLB misses will happen. As was pointed out by Nathan Kurz, my memory benchmark can be easily modified to test the huge-page scenario. To do so, I allocate memory as follows (one needs Linux kernel 2.6.38 or later):

  1. if (useHugePage) { // As suggested by Nathan Kurz
  2. mem = reinterpret_cast<char *>(mmap(NULL, MemSize, PROT_READ | PROT_WRITE,
  4. madvise(mem, MemSize, MADV_HUGEPAGE);
  5. } else {
  6. mem = new char[MemSize];
  7. }

To enable the huge-page mode, one needs to run the benchmark with the following parameters (12 means that I will allocate 12GB)

./test_mem_latency 12 1

Now, I get a 1.5x improvement in the case of truly random access, and an amazing 5-fold improvement in the case of the semi-random (gapped) access.

Huge pages do come at a price. In particular, they may lead to a lower memory utilization (see also 4.3.2 of Ulrich Drepper paper). Yet, some memory inefficiencies can be tolerated (when speed is more important).

PS: I thank Nathan Kurz who proposed this optimization (including the code to allocate the memory) and memory speed-related references.

UPDATE: There was an inaccurate description of the paging mechanism. Thanks to Daniel Lemire who noticed this.

What you must know about alignment in 21st century

Folks who program for RISC-processors need to ensure proper alignment of certain data types. The Intel people, however, may not have such concerns. Neither they need to fear about crashing on unaligned data reads, nor unaligned reads have a substantial impact on performance. Unless, of course, we explicitly use special assembly store/load instructions that do require the data to be aligned. In this case, if we mess up, we get a deserved punishment for being fancy.

Simply speaking, folks writing in plain standard C++ should not worry about such things, right? Wrong, compilers are becoming increasingly smart and they may automatically unroll your loops. In that, they will generate Single Instruction Multiple Data assembly instructions. These instructions, which are used to carry out operations on small vectors, may require aligned data.

Consider the following code:

  1. double dist(const double*x,const double*y,size_t qty){
  2. double res=0;
  3. for (size_t i = 0; i < qty; i++)
  4. res += x[i] - y[i];
  5. return res;
  6. }

There is a loop and this loop can be unrolled. More specifically, a compiler may generate instructions to carry out 2 additions/subtractions if the CPU supports SSE extensions, or 4 additions/subtractions in the case of AVX extensions being available. This technique is commonly known as vectorization.

On the up side, vectorization can boost performance of your programs. On the down side, a compiler may use load/store instructions that assume the data to be aligned. In the above example, the compiler may assume that pointers x and y are aligned on an 8-byte boundary. But what if they are not? Your program will crash! Turns out this is a known issue, which was reported as a bug. Yet, the GCC folks closed the issue without fixing, because alignment behavior is, apparently, not enforced by the standard.

For those who want to learn more details, I created a reproducible example. If you type make, this will create two programs testalign_crash and testalign_nocrash (as well as corresponding assembly code). I used GNU C++ 4.7, but the example might not work in your specific case. This is why I also committed the assembly code that is produced in both cases: One where the program crashes and another when it does not. In the assembly of the crashing program there is an aligned read command vmovapd. If you replace all unaligned reads vmovapd with their unaligned "siblings" vmovupd, you will get the program that works without crashing (see the assembly code vectorized_manually_fixed.s) To compile the assembly file, just type:

gcc vectorized_manually_fixed.s -lstdc++

I think that one needs to understand the issue at least in some detail, unless she has spare 12 hours to debug weird program behavior. What is really "nice": An alignment-violation segfault looks exactly the same way as memory corruption segfaults. At least, I don't know how to tell the difference.

UPDATE1: It is also recommended to use -Wcast-align, though, I suspect this is not a bullet-proof recommendation.
UPDATE2: Also note that there is apparently no good reason for GCC folks to use aligned reads (performancewise it's about the same as unaligned ones). But with these kind of optimizations, a lot of old code could crash now.
UPDATE3: Following a hint by Nathan Kurz, I added handlers to catch segmentation fault and bus error signals. Now, you can see that the OS/CPU generate the segmentation fault signal, not the bus error signal. To do so, you need to uncomment the line:

  1. // Uncomment this line to see which signal handler is activated.
  2. //#ifdef CATCH_SIGBUS

Efficient exponentiation by square rooting

Recently, I was discussing the problem of efficient exponentiation to integer powers. As I learned, the GNU library (apparently aiming for maximum precision) cannot carry out this operation efficiently. One can do better by doing exponentiation by squaring. What if the power is not integer, but fractional? The GNU library is working super slowly in this case and I cannot get more than miserly 20 million of exponentiations per second even on a modern Core i7.

What if I do not care about exponentiating efficiently for all the powers, but only for some rational ones? Can I do better? For example, I want to play with similarity search in pesky non-metric Lp spaces (for p < 1) and I can select only convenient values of p such as 1/2, 1/4, or 1/8. If I multiply these values by 8, I obtain integer numbers. One can see, and it is sure a very old idea, that we can use exponentiation by square rooting here. For instance, to compute the power 1/8, we can apply the square root three times. For the power 1/2 + 1/4, we need to obtain the square root and memorize it. Then, we apply the square root two more times and multiply the result by the already computed square root.

This algorithm relies on the binary representation of the exponent and works for the same reason as exponentiation by squaring. But this should be a crazy idea, because computing a square root is a costly operation, right? No, it is probably not. In many cases, square rooting takes about 2-3 CPU cycles and can be two orders of magnitude faster than the function pow!

I wrote a simple test to verify my hypothesis. As usual, I compile using both GCC and Intel. To prevent Intel and GNU from "cheating", I sum up all computed distances and print the result in the end. This is a good-enough trick for the GNU compiler, but not for the Intel compiler. If the variable sum becomes very large, the Intel-generated code is smart enough to stop computing powers, which defeats the purpose of testing. This is why I additionally adjust the variable sum through multiplying it by small constants. Adjustment operations do introduce little overhead, but it is very small compared to the cost of exponentiation. And, of course, we are careful enough not to use division:

  1. for (int j = 0; j < rep; ++j) {
  2. for (int i = 0; i < N*4; i+=4) {
  3. sum += 0.01 * pow(data1[i], data2[i]);
  4. sum += 0.01 * pow(data1[i+1], data2[i+1]);
  5. sum += 0.01 * pow(data1[i+2], data2[i+2]);
  6. sum += 0.01 * pow(data1[i+3], data2[i+3]);
  7. }
  8. sum *= fract;
  9. }

Some benchmarking highlights:

  1. In my test, exponentiation by squaring either matches performance of the GNU pow or is substantially faster;
  2. When I plug the improved pow into the actual code for nearest neighbor search in pesky Lp (p<1) spaces, it provides more than a five-fold speed-up.
  3. Exponentiation by squaring can even be faster than the Intel's function pow for exponent with less than 5 binary digits (if you compile the code using the Intel's compiler).

Notes on precision. This version produces results, which are almost identical to the results from the GNU function. If we can tolerate an approximate version (with a relative error about 10-5, there are more efficient solutions).

Is Java memory efficient?

This was inspired by a real task of sorting 25 million URLs. The topic of memory efficiency in Java is old. Yet, running out of memory and making the garbage collector to consume 700% of CPU (in a futile attempt to find free memory) is never old. So, I decided to run my own tests: Can I sort 10 and 50 million URLs in the memory of my computer?

Test setup: 16 GB of memory, Intel Core i7 laptop CPU, 3.3 Ghz peak frequency. I use Oracle JVM 1.7, but similar results are obtained using OpenJDK 1.7. The maximum memory size for the JVM is set to 16 GB (through JVM flags). Java is built and run using Maven. Maven displays memory usage. In addition, the peak memory usage of both C++ and the Java program is obtained through a custom utility, which checks memory consumption 10 times per second. The code is available online.

Each URL is about 50 characters. I generate strings whose lengths are random and uniform numbers from 40 to 60. Then, I create an array where I save references to two-element objects. One element is a URL (String type). Another element is an ID (int). I am careful enough to create each string only once. Overall, there are two test cases with 10 and 50 million strings, respectively.

In addition, I wrote a C++ program that does the same thing: randomly generates objects containing short strings (URLs) and integer ids. Then it saves object references in a fixed-size array (more precisely, you use pointers in C++). In Java a character uses 16 bits (two bytes). Thus, in the C++ program I allocate strings that are twice as long as that in the Java program. I also explicitly delete all the memory before exiting the C++ program.

For 10 million strings, the data needs 0.96 GB to be stored (without overhead). The other statistics is:

  Peak memory usage (Gbs) Run-time (secs)
Java 3 18.4
C++ 1.68 2.7

For 50 million strings, the data needs about 5 GB to be stored (without overhead). The other statistics is:

  Peak memory usage (Gbs) Run-time (secs)
Java 10.5 120
C++ 8.42 11

As you can see, in this simple example, Java is a somewhat "greedier" than C++, but not terribly so. When the number of strings is smaller (10 million), the Java program uses thrice as many bytes as it needs to store data (without overhead). For the larger data set, the Java program the overhead coefficient is only 2.1. For C++, the overhead coefficient is roughly 1.7 in both cases. As advised by one of my readers, even smaller footprint in Java can be achieved by using specialized libraries. One good example is the FastUtil library. What such libraries typically do: they pack many strings/objects tightly in a byte array.

Note that the run-time of the Java program is considerable. As advised in the comments, it takes much shorter time to run, if we increase the amount of memory initially allocated by the garbage collector (option -Xms). However, in this case, it is harder to measure the amount of memory consumed by a Java program.


Subscribe to RSS - srchvrs's blog