log in | about 

The first use of plus operator in an online search engine

Remember an old plus operator obsoleted by Google in 2011? In many search engines, including Google before October 2011, this operator is used to indicate a mandatory (or an important) keyword. Do you know when was the plus operator first used in an online search engine? I bet few would guess that it happened half a century ago, in the pre-Internet era:

In 1965, TEXTIR permitted users to do some search term weighting. By preceding a term with a plus sign, a searcher could direct TEXTIR to increase the score assigned of that word, and thus raise the score of the source document that contained that word.

How was the online search possible before Internet? One could use a phone line (and apparently a dial-up modem):

Queries were sent to SDC's Q-32 computer in Santa Monica via telephone from a Teletype Model 35 terminal ... In response, the system ... transmitted the texts of retrieved reports back by Teletype in relevance rank order.

Source: A History of Online Information Services 1963-1976 by C. P. Bourne and T.B. Hahn.

The online search service, one of the first of the kind, was developed and provided by the System Development Corporation (SDC). SDC is considered to be the first software company in the world.

It is not the ideas that are overrated, it is the implementation that is undervalued

I think that we, as a society, have come to an important realization: The notion of the Idea Person, who effortlessly produces a stream of ingenious ideas to be implemented by less intelligent underlings, needs to be deflated. At least, many of us do understand that good ideas are not born easily. In contrast, a good idea is a result of a tedious selection process that involves experimentation, reading, backtracking, hard work, and exchange of knowledge. It is also not unusual that the idea evolves substantially in the course of implementation. Yet, little or no credit goes to an Implementation Person.

As a result of the existing imbalance, some people have come to another extreme conclusion: Ideas are not valuable. Here, I have to disagree. Not all ideas are worthless. The problem is that it is hard to distinguish between a good and a bad idea until an implementation is attempted. Nevertheless, a good idea is an important ingredient of progress: Success is not possible without proper implementation, but it is not possible without good ideas either. As it was put by my co-author Anna, it is not the ideas that are overrated, it is the implementation that is undervalued.

Efficient grapecounting in your vineyard via passive computer vision

Believe it or not, the USA is the largest consumer of wine that guzzles more than 10% of all wine produced on the planet. However, it lags somewhat in production. Turns out that maximizing grape yield relies heavily on measurements during the growing season, in particular, on crop estimation. If certain areas are underperforming, it is often possible to fix the issue by, e.g., additional irrigation and fertilization.

Crop estimation is an expensive labor-intensive process that was previously carried out only by humans. The Robotics Institute of Carnegie Mellon University (in collaboration with Cornell University and stakeholders) works on developing automated measuring techniques. At Carnegie Mellon University, the group is lead by Stephen Nuske.

What is truly astonishing is that the proposed technology relies only on passive vision techniques, which are considered unreliable to be used outside a lab. Unlike self-driving cars requiring expensive laser-powered sensing devices called LIDARs, the proposed technology uses only a camera. The camera resides on a small cart that drives at a speed of about 5mph (if I remember correctly, there is also a flash to neutralize variability in lighting). While driving, the camera makes overlapping pictures of grape vines. Obtained images are processed to detect individual grapes and count them!

Although image recognition algorithms have reached a certain level of maturity, it is still challenging to detect individual grapes, because there are millions of potential locations to check in a single picture. This is especially hard when grapes did not ripen (and consequently both leaves and grapes are green). However, the researchers from the Robotics Institute of Carnegie Mellon University can count grapes even in real time! To accomplish this complex task, they use a combination of a quick high-recall low-precision filtering algorithm and a more accurate algorithm that removes false matches. The high-recall low-precision algorithm is an ensemble of two relatively simple key-point detection algorithms. The approach is described in a series of publications. The overall accuracy seems to be pretty good and the technology might be commercialized in not-so-distant future.

To conclude, I would like to note that, in addition to grape counting in your vineyard, Stephen Nuske worked on several other cool projects, where passive vision was applied to real-world problems. These may be interesting to both practitioners and lab scientists specializing in computer vision.

Algorithms to merge sorted lists or arrays

I have written a rather thorough description of algorithms that one can use to merge sorted lists or arrays of integers. Feel free to vote for this description on Quora. Here I decided to duplicate my answer (slightly revised and improved).

The choice of the merging algorithm depends on (1) the distribution of data (2) the hardware that you use. In that, there are several major approaches or a combination thereof that can be used:

  1. Classic k-way intersection with the priority queue. I believe it's described in Knuth. All the lists should be sorted in advance. You read the smallest values from each list and put them into the queue. More specifically, you put the pair (value, list id). Then you extract the smallest value using the queue and output it. If it came from list K, you extract the smallest value from the list K and push the smallest pair (value, K) to the priority queue (while simultaneously removing it from list K) . And so on so forth.

    Priority queue is not especially fast, in particular, because working with a queue entails a lot of branching (can be slow on both CPUs and GPUs due to branch misprediction). Therefore, other approaches may be more efficient sometimes.

  2. Pairwise merge sort. It is a well-known algorithm, so I won't describe it here. However, if you merge two lists, where one is much shorter than other methods can do better.

    In particular, you can iterate through a shorter list and find an insertion point in the large list using an exponential search (a fancier and more efficient version of the binary search). We used this approach in the context of list intersection, but the same method works well for unions.

  3. Using bitmasks. If your lists are represented as bitmasks, merge is super fast. Extraction of the result can be a bit tricky. However, using modern CPU instructions, you can do it rather easily. Daniel Lemire covered the topic of bitmap decoding extensively. Alternatively, one can use hashing.

    Encoding the whole list as a bitmap can be wasteful. This is why, people use some hybrid approaches where only a part of the list is encoded as a bitmap. If you have a sorted list as an input, it can actually may make sense to convert it first to a bitmap and then carry out a union/intersection using the bitmap.

  4. Using the ScanCount algorithm. Imagine that the minimum number is zero and the maximum number is M. You can create a table with M+1 elements that are all set to zero initially. To carry out a merge, you have to iterate over lists that you merge. If, during the iteration you encounter the number X, you set the element X in the table to one (or increment it if you need to know the number of lists that contain the number). Finally, you iterate over the (M+1)-element table and check which elements are non-zero. Bonus: input lists do not have to be sorted!

    The table may have byte or bit elements. Zeroing table elements before merging can be done in several ways. One very simple approach is via the library function memset (it's memset in C/C++ may have different name in other languages). Though this seems to be naive, memset can zero about 10 billion integers per second for cache-resident data. See the test program here.
    ScanCount can be surprisingly efficient.

    To fit data into cache, you need to reuse the same small table M ~ 60-100K elements. In practice, of course your numbers will be larger than M. However, you can split your inputs and process each split separately.

  5. To conclude, I would mention that there are more advanced so-called adaptive algorithms, which I don't remember off the top of my head. Google something like "adaptive list intersection", or "adaptive list merging".

Branchless code that would leave you speechless (C++ streams are super expensive)

I was looking for a portable C++11 way to check if a file (with a given name) exists. No luck, unfortunately, this functionality will be available only in C++14. However, I found one workaround that left me speechless. It is not actually the solution itself, it is rather a comment that is so fascinating:


  1. bool b = std::ifstream("filename").good();

Comment: Without the branch instructions (like if) it must perform faster as it needs to be called thousands of times.

Gosh! If you care about these kind of micro-optimizations, you should learn about the costs first. A branch misprediction is merely several wasted CPU cycles, but C++ streams are 2 orders of magnitude more costly. Not to mention all the overhead related to the filesystem calls!

Even for a filesystem-less stream, it takes a thousand CPU cycles to construct. Why is it so horribly slow? I have no idea, but such code should be avoided wherever performance is important (of course, we can use it otherwise).

UPDATE: For a good comment on the inefficiency of C++ streams (by Sergey Nepomnyachiy) see my Google+ account.


Subscribe to RSS - srchvrs's blog