log in | about 
 

Algorithms to merge sorted lists or arrays

I have written a rather thorough description of algorithms that one can use to merge sorted lists or arrays of integers. Feel free to vote for this description on Quora. Here I decided to duplicate my answer (slightly revised and improved).

The choice of the merging algorithm depends on (1) the distribution of data (2) the hardware that you use. In that, there are several major approaches or a combination thereof that can be used:

  1. Classic k-way intersection with the priority queue. I believe it's described in Knuth. All the lists should be sorted in advance. You read the smallest values from each list and put them into the queue. More specifically, you put the pair (value, list id). Then you extract the smallest value using the queue and output it. If it came from list K, you extract the smallest value from the list K and push the smallest pair (value, K) to the priority queue (while simultaneously removing it from list K) . And so on so forth.

    Priority queue is not especially fast, in particular, because working with a queue entails a lot of branching (can be slow on both CPUs and GPUs due to branch misprediction). Therefore, other approaches may be more efficient sometimes.

  2. Pairwise merge sort. It is a well-known algorithm, so I won't describe it here. However, if you merge two lists, where one is much shorter than other methods can do better.

    In particular, you can iterate through a shorter list and find an insertion point in the large list using an exponential search (a fancier and more efficient version of the binary search). We used this approach in the context of list intersection, but the same method works well for unions.

  3. Using bitmasks. If your lists are represented as bitmasks, merge is super fast. Extraction of the result can be a bit tricky. However, using modern CPU instructions, you can do it rather easily. Daniel Lemire covered the topic of bitmap decoding extensively. Alternatively, one can use hashing.

    Encoding the whole list as a bitmap can be wasteful. This is why, people use some hybrid approaches where only a part of the list is encoded as a bitmap. If you have a sorted list as an input, it can actually may make sense to convert it first to a bitmap and then carry out a union/intersection using the bitmap.

  4. Using the ScanCount algorithm. Imagine that the minimum number is zero and the maximum number is M. You can create a table with M+1 elements that are all set to zero initially. To carry out a merge, you have to iterate over lists that you merge. If, during the iteration you encounter the number X, you set the element X in the table to one (or increment it if you need to know the number of lists that contain the number). Finally, you iterate over the (M+1)-element table and check which elements are non-zero. Bonus: input lists do not have to be sorted!

    The table may have byte or bit elements. Zeroing table elements before merging can be done in several ways. One very simple approach is via the library function memset (it's memset in C/C++ may have different name in other languages). Though this seems to be naive, memset can zero about 10 billion integers per second for cache-resident data. See the test program here.
    ScanCount can be surprisingly efficient.

    To fit data into cache, you need to reuse the same small table M ~ 60-100K elements. In practice, of course your numbers will be larger than M. However, you can split your inputs and process each split separately.

  5. To conclude, I would mention that there are more advanced so-called adaptive algorithms, which I don't remember off the top of my head. Google something like "adaptive list intersection", or "adaptive list merging".



Branchless code that would leave you speechless (C++ streams are super expensive)

I was looking for a portable C++11 way to check if a file (with a given name) exists. No luck, unfortunately, this functionality will be available only in C++14. However, I found one workaround that left me speechless. It is not actually the solution itself, it is rather a comment that is so fascinating:

Solution:

  1. bool b = std::ifstream("filename").good();

Comment: Without the branch instructions (like if) it must perform faster as it needs to be called thousands of times.

Gosh! If you care about these kind of micro-optimizations, you should learn about the costs first. A branch misprediction is merely several wasted CPU cycles, but C++ streams are 2 orders of magnitude more costly. Not to mention all the overhead related to the filesystem calls!

Even for a filesystem-less stream, it takes a thousand CPU cycles to construct. Why is it so horribly slow? I have no idea, but such code should be avoided wherever performance is important (of course, we can use it otherwise).

UPDATE: For a good comment on the inefficiency of C++ streams (by Sergey Nepomnyachiy) see my Google+ account.



There is nothing wrong with "comprised of"!

Some people believe that the phrase "comprised of" is incorrect. The main argument here is that the only meaning of "comprise" is "include". In other words, "comprise" denotes a composition, but does not necessarily refer to all composite elements. In contrast, the phrases like "to be composed of" or "to be made up of" are followed by a complete list of all composite elements. Because "comprise" should be used only as a synonym for "include" and "contain", "to be comprised of " does not make any sense as do not make sense phrases "to be included of" and "to be contained of".

In truth, "comprise" has a second meaning "make up", i.e., "compose", and both major dictionaries, e.g., Oxford and Merriam-Webster acknowledge this fact (I also checked a printed edition of Merriam-Webster). Never mind, that "comprised of" was in use as early as in the 18th century. Some folks still cannot accept this.

The wrongness of "comprised of" was debunked multiple times, see e.g. entries [1] and [2] in the Language Log. What is a new spin here? Turns out that there is a guy Bryan Henderson who devoted his life to exterminating all the occurrences of "comprised of" in Wikipedia. Specializing in exactly this "grammatical error", he has made about 50 thousand Wikipedia edits! To support his point of view, he compiled a list of arguments. However, as I show below, none of these arguments is convincing while some of them are plainly fallacious.

1. Even dictionaries acknowledge this usage, though they all tell you it's disputed and typically discourage writers from using it. See for example Wiktionary.

This claim must be backed up by better evidence. While some of the dictionaries might tell you otherwise, two some of the most authoritative sources (Merriam-Webster and Oxford) do support the use of "comprised of" as a synonym for "composed of", i.e., they certainly do not discourage anyone. Wiktionary, in turn, seems to support its claim by citing century old editions of dictionaries. Excuse me, English evolved quite a bit since then.

2. It's completely unnecessary. There are many other ways to say what the writer means by "comprised of". It adds nothing to the language.

English is ambiguous and redundant: We have synonyms, paraphrases, etc. Let us get rid of them, they add nothing to the language! In fact, it is often considered a bad style to repeat the same expression in the same sentence or paragraph (unless it is a precise technical term). From this perspective, we should all welcome "comprised of" as yet another way to avoid dull repetition.

3. It's illogical for a word to mean two opposite things.

It is not clear what "opposite things" mean here: A better explanation is clearly needed. If Bryan hints that "include" and "to be comprised of" are antonyms, I cannot agree (and, probably, few people would). To me, these are synonyms or near synonyms. Furthermore, I could not find a source claiming that "include" is an antonym for "to be comprised of" (see, e.g., this page).

4. The etymology of the word does not support "comprised of". It comes from Latin words meaning to hold or grasp together. Other English words based on those same roots are "comprehensive" and "prehensile" (as in a monkey's prehensile tail: it can grab things). Comprise's French cousin also makes this clear.

It is not uncommon for a (borrowed) word to change the meaning over time. There are gazillions of other examples like this one. Why should we attack this particular deviation from the original etymology? Besides, English is not a Latin language and I see no good reason to stick to the true Latin meanings of borrowed words.

5. It's new. Many current Wikipedia readers learned to write at a time when no respectable dictionary endorsed "comprised of" in any way. It was barely ever used before 1970. Even now, style manuals frequently call out this particular usage as something not to do.

As noted here, the controversial use was very common in legal practice even before 1970. The Google Books search gives plenty of earlier examples (many not related to the legal domain) as well.

The mainstream editions may have neglected this specific use of "comprise" (as a synonym for "make up") until recently. However, nowadays, at least two respectable dictionaries fully endorse this without any reservations! I would reiterate that now this phrase is considered to be a part of the Standard English by respectable linguists and editors. One should be careful not to use "comprise of" (in the active voice), though, because this usage does not seem to be correct.

Ok, but what about journals and newspapers? They do not oppose it, either. Check, e.g., the New Yorker and the Washington Post.

6. It's imprecise. English has a variety of ways to say things the writer means by "comprised of". "Composed of", "consists of", and "comprises" are subtly different. In sentences I edit, it often takes careful thought to decide just which one of these things the article should say. Thus the sentence with "comprised of" isn't quite as expressive.

I agree that it is imprecise. Furthermore, the exact meaning of this phrase was even disputed legally! All natural languages including English are ambiguous. However, in most cases, people cope with ambiguity quite successfully. Thus, unless there is statistical evidence of significant confusion caused by this specific imprecision, it is best not mess with it!

7. Many writers use this phrase to aggrandize a sentence -- to intentionally make it longer and more sophisticated. In these, a simple "of", "is", or "have" often produces an easier-to-read sentence. (Example: "a team comprised of scientists" versus "a team of scientists").

There is nothing wrong with aggrandizing per se. It is a matter of style, I guess. The claim that exterminating "comprised of" makes things more readable seems to be a personal opinion, which is not backed up by hard numbers.

Bryan may believe that his hobby is harmless. Even if "comprised of" is correct, it is always possible to replace the phrase with an equivalent "composed of", right? Well, I don't think so. Brian is a software engineer and, as software engineers know full well, hasty edits eventually do break something. There is no way to avoid a mistake, no matter how careful you are! Therefore, one of the most important engineering principles is: Do not to fix things that are not broken! Arguably, the use of "comprised of" does not affect readability much (at least, it was not proven otherwise). However, there are tons of other Wikipedia articles that are indeed poorly written. Editing these pages will be a more valuable public service.



Teaching an old cat new dog tricks

They say you cannot teach an old dog new tricks, but what about cats? Let me introduce our Bengal cat Masyanya Markovna Bonus, whose patronymic name Markovna refers to the Markov stochastic memoriless process. Once fed, she quickly transitions to a new state (discarding the history simultaneously) and starts demanding the food again.

Masyanya Markovna is ten years old. By cat's standards this a near-retirement age. This is often the time when cats start developing health problems. For example, our poor kitty lost all her teeth recently. Yet, she stays mentally sharp and eager to learn.

We are not the first owners of Masyanya: she has been living with us for only three years. It was not until two years ago when my wife noticed how our ever hungry animal was reaching out for food. Jokingly she suggested we should teach her to do the ‘Stand!’ trick. We knew cats could be trained, but our cat was already fairly old. Before she joined our family, she was not trained to do any tricks. Neither did I have animal-training experience.

To our astonishment, the experiment was successful. Encouraged by this, we decided to try some new dog tricks. It was a slow process bearing some similarity to training a deep artificial neural network. In short, it can be easy to teach the kitty do one trick, but quite problematic to teach another one. Cats (somewhat similar to sophisticated machine learning algorithms) overtrain easily. As a result, no matter what the command is, the cat may want to perform the trick she learned first. However, once you are done with two tricks, you can do many more.

Another parallel to machine learning: Cats are sensitive to priors (at least this was our case). For example, they may perform the trick quite well in a familiar setting (e.g., the living room), but completely refuse to do anything in another setting. They react to the voice commands, but they are also very sensitive to the body language.

To substantiate our story, we post the video of the latest performance, where our awesome cat does several dog tricks:

What is a takeaway message? One is quite clear: Cats are highly trainable, even mature ones. But, perhaps, more importantly, if mature cats can learn new tricks, people should not be afraid to do the same. Even though lots of people may try to discourage you, uncovering and nurturing your talent should not stop in your 3rd or 4th decade.

This series of cat posts is co-authored with Anna Belova.



If data is new code, we need new design patterns

As the number of data-driven applications increases, data is becoming an important part of the code base. It is such a clear trend that some people even rushed to announce a demise of code. "Code is a commodity", claims Henry Verdier, and "Data is the new code". While this seems to be an exaggeration, our increasing dependence on data has consequences. In fact, as Sculley and colleagues argue in their recently published paper "Machine Learning: The High-Interest Credit Card of Technical Debt" (D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young), the cost of data dependencies outweighs the cost of code dependencies in projects heavily relying on data-driven (aka machine learning) approaches. Forget about the never ending "functional vs objected oriented" debate. Let us get straight on the issue of data dependencies first.

Unfortunately, it is not easy to do. Sculley and colleagues argue that in traditional software engineering, the number of interdependencies can be greatly reduced via encapsulation and modular design. This is possible because we write modules and functions to satisfy certain strict requirements. We know there are certain logical invariants and can check functionality via unit and integration tests. For certain applications, we can even formally verify correctness. However, as note the authors, "... it is difficult to enforce strict abstraction boundaries for machine learning systems by requiring these systems to adhere to specific intended behavior. Indeed, arguably the most important reason for using a machine learning system is precisely that the desired behavior cannot be effectively implemented in software logic without dependency on external data."

At the same time, it is not easy to decompose the data, because there are no truly independent features. It is hard to isolate an improvement (which features did contribute most?) and it is hard to debug problems. Overall, machine learning systems are much more fragile, because small local changes (e.g., in regularization parameters, convergence thresholds) may and often do have ripple effects. Sculley and colleagues call this phenomenon a CACE principle: Changing Anything Changes Everything.

Clearly, as the data-driven applications become even more common, there will be an established set of best practices and design patterns tailored specifically to management of data. Some of the emerging patterns are already discussed by Sculley and colleagues. Among other things, they recommend reducing the amount of glue code, removing little-impact features and experimental code paths.

There are a number of tools (in many languages) to identify code dependencies. Sculley and colleagues argue that data dependencies can be analyzed as well, in an automatic or semi-automatic manner. At the very least, one can catalog all the features used in the company. Different learning modules can report on the usage of the features to a central repository. When a version of the feature changes, or the feature becomes deprecated, it is possible to find all relevant consumers quickly. Such a feature management tool greatly reduces the risk of having a stealthy consumer, e.g., one that reads features from log files, whose behavior is adversarially affected by deprecation or change of certain input signals.

Machine learning is a powerful tool allowing us to quickly build complex systems based on previously observed data patterns instead of laboriously handcrafting the patterns manually. Yet, its performance hinges on the assumption that previously observed statistical properties of the data remain unchanged in the future. A situation, where this assumption is violated, is called a concept drift. As a result of the concept drift, performance of a predictive model deteriorates with time. The more sophisticated is the model, the more likely it is to suffer from this drift. In particular, an error rate of a simple linear model may become equivalent to that of a more sophisticated model!

Unfortunately, in the real world the main machine learning assumption does not hold. An example of the domain, where the concept drift is especially stark, is spam detection. The current anti-spam software is good, but it would not be good without constant retraining and introduction of new features. Again, Sculley and colleagues do discuss this problem and propose a couple of mitigation strategies.

To conclude, I again emphasize that data-driven applications are different from classic software projects. It is expected that new best practices and design patterns will evolve and mature in the future to deal with problems like data dependencies and the ever changing statistical properties of the external world. The paper "Machine Learning: The High-Interest Credit Card of Technical Debt" overviews some of the design practices already used successfully by Google folks. I would recommend reading this paper and following some of the references to everyone interested in building large interconnected machine learning systems.



Pages

Subscribe to RSS - blogs