log in | about 
 

Today there was a first class of the large-scale machine learning course. If you run things at scale, you need methods with (near) linear run-time, i.e., $O(n)$ or $O(n \log(n))$. Not all big-O-es are created equal and constants do matter. Our instructor wanted to illustrate this statement. To this end, he used a well-known chart due to Peter Norvig and Jeff Dean, which highlights latencies of different memory types. As everybody remembers, computer memories are organized in a hierarchy. On top of this hierarchy is the fastest, but most expensive memory, whose size is small. At the bottom, there is a very cheap memory, e.g., a rotational or solid-state drive (SSD), which works slowly. Thus, reading $O(n)$ bytes from the hard drive is not the same as reading $O(n)$ bytes from the main memory: There is an orders-of-magnitude difference.

It is a rather well-known fact that the disk-based memory (even in the case of the SSD) is very slow. In fact, in many cases, it takes shorter time to read compressed data from the disk (and subsequently decompress it) rather than to read the uncompressed data. Another well-known fact is that the gap between the fastest volatile memory (CPU registers or L1 cache) and the slowest volatile memory is huge: A truly random read from main memory can cost you hundreds of CPU cycles. How many exactly?

The numbers in the Norvig-Dean's chart are a bit outdated, so I wrote a simple application to test memory speeds. I have a Core i7-4700MQ processor with the top frequency of 3.4Ghz. My computer has 16 GB of DDR3 memory. By running (use 12 GB of memory):

./test_mem_latency 12 0

I obtain that a random-access takes about 62 nanoseconds or approximately 200 CPU cycles. In contrast, it takes 0.3 nanoseconds to read one byte sequentially. These measurements are in line with tests made by other folks.

You can see that main memory is similar to the rotational hard drive in terms of performance: There is an orders-of-magnitude difference between the sequential and random-access reading speed. Furthermore, even though it is far less known, compressed data can also be read from memory faster than uncompressed data! For commodity servers, memory bandwidth rarely exceeds 10-20 GB/sec, and it is not hard to exhaust it if you have dozens of cores. A bit more surprising is the fact that you can read compressed data faster than uncompressed even if you employ only a single CPU core!

Main memory now resembles a hard drive. Thus, one benefits from using optimization techniques previously tailored to hard drives. One obvious optimization technique is caching. To this end, memory is divided into chunks. When we access a chunk that is not in the cache, a cache miss happens. This instructs a CPU to load this chunk into the cache.

There is also a division of memory into pages. Nowadays, programs access pages using logical addresses rather physical ones. Clearly, the CPU needs to translate logical addresses to physical addresses through a special translation table. Even if place such a table into a "fast" L1/L2 cache, a translation operation will be costly. This is why the CPU additionally employs a Tranlslation Lookaside Buffer. This buffer is a fast, but small cache. When the CPU is able to translate the page address using the TLB cache, we are lucky. Otherwise, an expensive TLB miss happens.

If we use huge pages, fewer translations (from logical to physical address) are required. We can provide efficient address translation for larger regions of memory, and, consequently, fewer TLB misses will happen. As was pointed out by Nathan Kurz, my memory benchmark can be easily modified to test the huge-page scenario. To do so, I allocate memory as follows (one needs Linux kernel 2.6.38 or later):

  1. if (useHugePage) { // As suggested by Nathan Kurz
  2. mem = reinterpret_cast<char *>(mmap(NULL, MemSize, PROT_READ | PROT_WRITE,
  3. MAP_PRIVATE| MAP_ANONYMOUS, -1, 0));
  4. madvise(mem, MemSize, MADV_HUGEPAGE);
  5. } else {
  6. mem = new char[MemSize];
  7. }

To enable the huge-page mode, one needs to run the benchmark with the following parameters (12 means that I will allocate 12GB)

./test_mem_latency 12 1

Now, I get a 1.5x improvement in the case of truly random access, and an amazing 5-fold improvement in the case of the semi-random (gapped) access.

Huge pages do come at a price. In particular, they may lead to a lower memory utilization (see also 4.3.2 of Ulrich Drepper paper). Yet, some memory inefficiencies can be tolerated (when speed is more important).

PS: I thank Nathan Kurz who proposed this optimization (including the code to allocate the memory) and memory speed-related references.

UPDATE: There was an inaccurate description of the paging mechanism. Thanks to Daniel Lemire who noticed this.

Comments

Could you give link to the couse, i'm intersted in taking it. Thanks




Sure, it's a William Cohen's course. Even though it has a huge waiting list now (dozens of people), you may have a chance:
http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large...




There is a typo in the post, should be "lead to a *higher* memory utilization"

Otherwise, very nice post, thanks!




@Vadim,
Thank you for stopping by. I really doubt that it should. It can, if you use memory in a clever way (I guess you can have less fragmented memory).

But it can apparently be also wasteful. For example, from the man page:

This feature is primarily aimed at applications that use large mappings of data and access large regions of that memory at a time (e.g., virtualization systems such as QEMU). It can very easily waste memory (e.g., a 2MB mapping that only ever accesses 1 byte will result in 2MB of wired memory instead of one 4KB page).




Nice reading!