Text retrieval can and should benefit from using generic k-NN search algorithms

We believe that text retrieval can and should benefit from using generic k-NN search algorithms. To support our conjecture, we carried out a bunch of experiments, published a paper, as well as related software. A high-level summary of the paper is given in the talk, whose text we also post online (just in case, slides are also available).

What is all this about? Why should one use k-NN search? In a classic filter-and-refine pipeline, you would usually get candidate result set filtered by TFxIDF. What if we replace TFxIDF with some expensive-to-compute but accurate similarity? Clearly, we will not be able to use text-based inverted files to answer queries efficiently. At the same time, a brute-force comparison of query against every document would be terribly slow. However, we can try answer queries using some distance-based approximate k-NN search algorithm. If such approach is sufficiently fast, we might get a practical tool to find documents that are not possible or hard to find using TFxIDF based retrieval.

I would not claim that we have fully achieved our objective, but we have probably made a good step towards achieving it. In fact, the phrase "Let's replace" in the title of the paper means only that we see such a replacement as an important goal.

An electric highway may be, indeed, nearer than we think

Frankly speaking, I have been a bit skeptical about electric cars coming to our highways in large numbers. So, when I first heard about Germans wanting to ban sales of new internal combustion engines by 2030, my first thought was that this Bundesrat initiative was absolutely nuts (for the record, Bundesrat decision does not yet have legislative power. First, the number of plug-in cars is still laughable and not all of these plug-ins are fully electric. Second, the current infrastructure does not support en-mass charging of electric vehicles. Tesla and Nissan have (I guess incompatible) superchargers here and there, but... Man when was the last time you drove 500 miles? Imagine it is 700 now because you need to drive through a supercharging station. Last, but not least, I am not sure that battery technology is ready. These are all valid concerns, but, after doing some basic research, I have come to a conclusion that the era of electric cars may be closer than we thought.

Perhaps, my primary concern was the cost of a battery. Battery is, probably, the most expensive part of the electric car. For example, in 2010 you would pay 750 dollars per kWh of a Li-Ion battery. For an all-purpose electric car, one would need a 100+ kWh battery back, which would cost a whooping $75,000 in 2010. However, somewhat miraculously, the cost of battery reduced 5X. Furthermore, GM expects a further 1.5x reduction by the end of 2021. Wow, this means that already in 2021, the cost of a good battery would be only$10,000! This is still a lot. However, you have to remember that an all-electric car is a simpler gadget, which needs a simple engine and a simpler transmission. So, without battery shortages potentially hiking the battery price (which is, of course, a serious unknown variable), electric cars will soon be quite affordable. Perhaps, even cheaper than gasoline cars, which are also more expensive to maintain! To sum up this paragraph, even Li-Ion batteries seem to be quite a viable option. Furthermore, one should not exclude potential alternative battery technologies kicking in by 2030-2040.

Another big concern is, of course, lack of infrastructure. However, infrastructure would not necessarily be all that costly. For most commuter cars, charging can happen at home. In addition, it seems that it is actually much simpler to build superchargers than gas stations (credits to my neighbor Alex for this observation)! For example, gas stations require an underground fuel tank, but superchargers only require a reliable connection to the grid. A good question is where all the additional electricity would come from? It is a valid question, because powering electric cars with coal is not a good idea. Due to losses in, e.g., electricity transmission, the overall efficiency of such a system is not all that impressive compared to a fuel-efficient (e.g., hybrid) vehicle. In other words, we would likely only increase the amount of emissions by powering electric vehicles by new coal powerplants. Natural gas would be a better option, yet, it has its own issues. However, I also have high hopes to renewables. In particular, the price of solar panel has decreased to a point where utility companies are starting to lose money (due to people heavily relying on solar panels). At the very least, it would be affordable to use solar or wind or a combination thereof to power your local commute.

In conclusion, I note that, while adoption of electric vehicles is a process full of uncertainties, the electric highway now seems to be closer than I originally thought. Maybe, not in 2030, but 2040-2050 does not look as an unrealistic date to me any more.

Accurate BM25 similarity for Lucene: a follow-up

A couple of months ago, I published a post on improving BM25 Lucene similarity by getting rid of lossy document length encoding. I demonstrated that for a community QA retrieval task, effectiveness of Lucene's BM25 ranking scheme can be quite a bit lower compared to the lossless BM25 implementation. However, I did not test using standard TREC collections. Now I am filling this gap. To summarize my results, the difference between two similarity implementations on standard collections is noticeably smaller compared to the difference on a community QA task. Yet, this difference still exists. One may think that community QA tasks are quirky and, perhaps, biased in some way. However, I tend to think that this discrepancy stems from the difference in the average query length: community QA queries are much longer than TREC-Web queries. For this reason, they may be more sensitive to inaccuracies in the ranking algorithm. In particular, Stack Overflow queries are the longest and this is the collection where the difference between two BM25 implementations is the largest. Note that this only a hypothesis: Additional experiments to refute/support this hypothesis are, of course, welcome. Below, I describe my experiments in more detail. The code is on GitHub.

For this set of experiments, I use subsets of two sizeable TREC collections: ClueWeb09 and ClueWeb12. Each of these subsets (called category B subsets) comprise about 50 million HTML documents. While the document collections are large, query (or topic) sets are quite modest. For ClueWeb09, I use 500 first topics (and respective relevance judgements) from the Million Query Track. I do not use any further topics, because their relevance judgments are too sparse (many queries have no judgments at all). For ClueWeb12 my original plan was to use a standard NIST TREC collection of queries. Unfortunately, it has merely 100 queries/topics. For this reason, I do not get anything even close to statistically significant differences. Plus, as we learn from our simulations, such small topic sets of queries are quite unreliable.

For these reasons, I use the derivative collection UQV100 created by Peter Bailey and colleageus. Bailey et al. took TREC Web topics (years 2012-2013) and created several query variants of each topic via crowdsourcing. For example, the topic raspberry pi generated variants such as: amazon raspberry pi, buy raspberry pi, cost of raspberry pi, and so on. Then, for each query variant Bailey et al. generated query responses and judged them. A tricky part here is that they have not released relevance judgements for specific queries. Instead, they have merged relevance judgements for queries within a single topic. I nevertheless assume that all generated queries for the same original topic share the same set of relevance judgements. Implementing this assumption requires duplication of relevance judgements (henceforth, QRELs). Specifically, each query within a topic receives the same set of QRELs (technically, this is done by my script scripts/merge_uqv100.py)

Evaluation results are in the table below. Unlike the the previous post I decide to use more standard IR metrics, namely, ERR@20 and NDCG@20. I also do not measure retrieval time for Web collections, because their indices do not fit into memory of my laptop. Timings for community QA data is given in the previous post.

NDCG@20 ERR@20
Lucene BM25 0.1245 0.0064
Accurate BM25 0.1305 0.0067
Accuracy gain 4.8% 5.4%
p-value 2e-16 6e-13
Stack Overflow (10K queries)
Lucene BM25 0.1118 0.0057
Accurate BM25 0.1200 0.0061
Accuracy gain 7.4% 7.9%
p-value 2e-16 2e-16
ClueWeb09/One Million Queries (500 queries)
Lucene BM25 0.2621 0.0826
Accurate BM25 0.2699 0.0860
Accuracy gain 3% 4.1%
p-value 0.014 0.037
ClueWeb12/UQV100 (6099 queries)
Lucene BM25 0.1604 0.1813
Accurate BM25 0.1638 0.1851
Accuracy gain 2.1% 2.1%
p-value 2e-16 7e-7

Comparison of IBM Watson and Deepmind approaches to QA

This is written in response to a Quora question. It asks about differences and similarities between the IBM Watson QA system and a deep neural network system described in the DeepMind paper "Teaching Machines to Read and Comprehend Hermann et al 2015".

First I note that we don’t know exactly how IBM Watson works. However, we can clearly see that two systems solve two different problems. While both systems search for an entity that is answer to a question (henceforth answer entity), in the case of the IBM Watson, the answer entity is sought for in a large array of unstructured information. In contrast, the DeepMind approach only need to select an entity from a given document, which is guaranteed to contain the answer.

Finding an answer in a small document is usually an easier problem. My claim is backed up by data in Table 2 of the above-mentioned DeepMind paper. From this table we can see, that in 85% of all cases a correct answer is among top 10 most frequent entities. An easier problem does not mean an easy solution and, in fact, I do find Deepmind accuracy numbers to be impressive. DeepMind neural models beat a simpler word distance model by 20% without doing any feature engineering.

That said, finding an answer entity in a large collection is a more challenging problem, but it is essentially reduced to the problem of finding an answer in a much smaller (pseudo) document. Such a document is created using (mostly) information retrieval techniques by obtaining document snippets that are textually similar to a question. There are alternative approaches to finding relevant pieces of information such as knowledge bases (see also my older post here), but they do not seem to produce many answers.

Reducing the problem to finding an answer in a smaller collection is not enough. A more accurate extraction relies on a number of models and heuristics that we do not know (they are IBM's secret sauce). However, judging by IBM Watson publications, the key heuristic seems to be an answer type. For example, if the question is about a person, the system would extract people's name and focus on analyzing most frequent once. In that, we want to exclude person names that are already mentioned in the question text. To reiterate, the complete answer selection model is, of course, much more complicated and includes many more features such as textual similarity.

To conclude, I want to note that the original IBM Watson approach employs a lot of feature engineering. However, because the search in a large collection is reduced to a search in a much smaller subset of potentially relevant snippets, good results may be obtained by replacing manual feature engineering (and heuristics some of which are known from 60s) with neural network approaches similar to what is described in the DeepMind paper.

Summing up values inside 128 bit SSE or 256 bit AVX register: It's not easy to vectorize!

Imagine that you have 4 floating point values stored in a 128 bit SSE register or 8 floating point values stored in a 256 bit AVX register. How do you sum them up efficiently? One naive solution involves storing the values of the register to memory and summing them up using scalar operations. Is it possible to vectorize the summation efficiently? In other words, is it possible to sum up using SSE or AVX operations?

I tried some obvious solutions, in particular, an approach that uses the so called "horizontal" addition. I also tried to extract floating point values using SIMD operations and sum them up using scalar addition, but without storing the original floating point values to memory.

None of the approaches worked faster than the simpler scalar approach. Maybe the reason is that SIMD operations have high latency. In contrast, scalar additions have smaller latency and the CPU can execute several of them in parallel thanks to superscalarity. Yet, I am not 100% sure that this explanation is good. In any case, it seems to be hard to sum up values of SSE/AVX registers efficiently. As usual, my code is available for scrutiny.

UPDATE: in the first versions there was a bug that didn't show up, because I tested using only integer numbers. After fixing this bug, performance of a shuffle-based version increased a bit, but it's still much slower than the scalar addition.