log in | about 
 

This was inspired by a real task of sorting 25 million URLs. The topic of memory efficiency in Java is old. Yet, running out of memory and making the garbage collector to consume 700% of CPU (in a futile attempt to find free memory) is never old. So, I decided to run my own tests: Can I sort 10 and 50 million URLs in the memory of my computer?

Test setup: 16 GB of memory, Intel Core i7 laptop CPU, 3.3 Ghz peak frequency. I use Oracle JVM 1.7, but similar results are obtained using OpenJDK 1.7. The maximum memory size for the JVM is set to 16 GB (through JVM flags). Java is built and run using Maven. Maven displays memory usage. In addition, the peak memory usage of both C++ and the Java program is obtained through a custom utility, which checks memory consumption 10 times per second. The code is available online.

Each URL is about 50 characters. I generate strings whose lengths are random and uniform numbers from 40 to 60. Then, I create an array where I save references to two-element objects. One element is a URL (String type). Another element is an ID (int). I am careful enough to create each string only once. Overall, there are two test cases with 10 and 50 million strings, respectively.

In addition, I wrote a C++ program that does the same thing: randomly generates objects containing short strings (URLs) and integer ids. Then it saves object references in a fixed-size array (more precisely, you use pointers in C++). In Java a character uses 16 bits (two bytes). Thus, in the C++ program I allocate strings that are twice as long as that in the Java program. I also explicitly delete all the memory before exiting the C++ program.

For 10 million strings, the data needs 0.96 GB to be stored (without overhead). The other statistics is:

  Peak memory usage (Gbs) Run-time (secs)
Java 3 18.4
C++ 1.68 2.7

For 50 million strings, the data needs about 5 GB to be stored (without overhead). The other statistics is:

  Peak memory usage (Gbs) Run-time (secs)
Java 10.5 120
C++ 8.42 11

As you can see, in this simple example, Java is a somewhat "greedier" than C++, but not terribly so. When the number of strings is smaller (10 million), the Java program uses thrice as many bytes as it needs to store data (without overhead). For the larger data set, the Java program the overhead coefficient is only 2.1. For C++, the overhead coefficient is roughly 1.7 in both cases. As advised by one of my readers, even smaller footprint in Java can be achieved by using specialized libraries. One good example is the FastUtil library. What such libraries typically do: they pack many strings/objects tightly in a byte array.

Note that the run-time of the Java program is considerable. As advised in the comments, it takes much shorter time to run, if we increase the amount of memory initially allocated by the garbage collector (option -Xms). However, in this case, it is harder to measure the amount of memory consumed by a Java program.

Comments

Hi Leo, maybe you could also try setting the Xms flag, that would pre-allocate the memory on the Java process, instead of growing it as required.

export MAVEN_OPTS="-Xmx16000m -Xms16000m"




Hi Elmer, this is a good point, thank you. Indeed, the run-time decreases sharply. I will update the post shortly. The only reason why I didn't set the minimum amount of memory: It makes harder to estimate real memory usage.




You can use Runtime to find the memory currently in use. E.g.,

Runtime runtime = Runtime.getRuntime();
long max = runtime.maxMemory();
long total = runtime.totalMemory();
long free = runtime.freeMemory();
long available = ( max - total ) + free;
System.out.println( "There are " + total + " bytes in use and " + available + " bytes of unallocated space" );




Hi Andrew,

Thanks a lot for this good advice. I am actually planning to carry out some more thorough tests some day. So, this will be useful.