log in | contact | about 

Accurate BM25 similarity for Lucene: a follow-up

A couple of months ago, I published a post on improving BM25 Lucene similarity by getting rid of lossy document length encoding. I demonstrated that for a community QA retrieval task, effectiveness of Lucene's BM25 ranking scheme can be quite a bit lower compared to the lossless BM25 implementation. However, I did not test using standard TREC collections. Now I am filling this gap. To summarize my results, the difference between two similarity implementations on standard collections is noticeably smaller compared to the difference on a community QA task. Yet, this difference still exists. One may think that community QA tasks are quirky and, perhaps, biased in some way. However, I tend to think that this discrepancy stems from the difference in the average query length: community QA queries are much longer than TREC-Web queries. For this reason, they may be more sensitive to inaccuracies in the ranking algorithm. In particular, Stack Overflow queries are the longest and this is the collection where the difference between two BM25 implementations is the largest. Note that this only a hypothesis: Additional experiments to refute/support this hypothesis are, of course, welcome. Below, I describe my experiments in more detail. The code is on GitHub.

For this set of experiments, I use subsets of two sizeable TREC collections: ClueWeb09 and ClueWeb12. Each of these subsets (called category B subsets) comprise about 50 million HTML documents. While the document collections are large, query (or topic) sets are quite modest. For ClueWeb09, I use 500 first topics (and respective relevance judgements) from the Million Query Track. I do not use any further topics, because their relevance judgments are too sparse (many queries have no judgments at all). For ClueWeb12 my original plan was to use a standard NIST TREC collection of queries. Unfortunately, it has merely 100 queries/topics. For this reason, I do not get anything even close to statistically significant differences. Plus, as we learn from our simulations, such small topic sets of queries are quite unreliable.

For these reasons, I use the derivative collection UQV100 created by Peter Bailey and colleageus. Bailey et al. took TREC Web topics (years 2012-2013) and created several query variants of each topic via crowdsourcing. For example, the topic raspberry pi generated variants such as: amazon raspberry pi, buy raspberry pi, cost of raspberry pi, and so on. Then, for each query variant Bailey et al. generated query responses and judged them. A tricky part here is that they have not released relevance judgements for specific queries. Instead, they have merged relevance judgements for queries within a single topic. I nevertheless assume that all generated queries for the same original topic share the same set of relevance judgements. Implementing this assumption requires duplication of relevance judgements (henceforth, QRELs). Specifically, each query within a topic receives the same set of QRELs (technically, this is done by my script scripts/merge_uqv100.py)

Evaluation results are in the table below. Unlike the the previous post I decide to use more standard IR metrics, namely, ERR@20 and NDCG@20. I also do not measure retrieval time for Web collections, because their indices do not fit into memory of my laptop. Timings for community QA data is given in the previous post.

NDCG@20 ERR@20
Comprehensive (Yahoo Answers!) 10K queries
Lucene BM25 0.1245 0.0064
Accurate BM25 0.1305 0.0067
Accuracy gain 4.8% 5.4%
p-value 2e-16 6e-13
Stack Overflow (10K queries)
Lucene BM25 0.1118 0.0057
Accurate BM25 0.1200 0.0061
Accuracy gain 7.4% 7.9%
p-value 2e-16 2e-16
ClueWeb09/One Million Queries (500 queries)
Lucene BM25 0.2621 0.0826
Accurate BM25 0.2699 0.0860
Accuracy gain 3% 4.1%
p-value 0.014 0.037
ClueWeb12/UQV100 (6099 queries)
Lucene BM25 0.1604 0.1813
Accurate BM25 0.1638 0.1851
Accuracy gain 2.1% 2.1%
p-value 2e-16 7e-7

Comparison of IBM Watson and Deepmind approaches to QA

This is written in response to a Quora question. It asks about differences and similarities between the IBM Watson QA system and a deep neural network system described in the DeepMind paper "Teaching Machines to Read and Comprehend Hermann et al 2015".

First I note that we don’t know exactly how IBM Watson works. However, we can clearly see that two systems solve two different problems. While both systems search for an entity that is answer to a question (henceforth answer entity), in the case of the IBM Watson, the answer entity is sought for in a large array of unstructured information. In contrast, the DeepMind approach only need to select an entity from a given document, which is guaranteed to contain the answer.

Finding an answer in a small document is usually an easier problem. My claim is backed up by data in Table 2 of the above-mentioned DeepMind paper. From this table we can see, that in 85% of all cases a correct answer is among top 10 most frequent entities. An easier problem does not mean an easy solution and, in fact, I do find Deepmind accuracy numbers to be impressive. DeepMind neural models beat a simpler word distance model by 20% without doing any feature engineering.

That said, finding an answer entity in a large collection is a more challenging problem, but it is essentially reduced to the problem of finding an answer in a much smaller (pseudo) document. Such a document is created using (mostly) information retrieval techniques by obtaining document snippets that are textually similar to a question. There are alternative approaches to finding relevant pieces of information such as knowledge bases (see also my older post here), but they do not seem to produce many answers.

Reducing the problem to finding an answer in a smaller collection is not enough. A more accurate extraction relies on a number of models and heuristics that we do not know (they are IBM's secret sauce). However, judging by IBM Watson publications, the key heuristic seems to be an answer type. For example, if the question is about a person, the system would extract people's name and focus on analyzing most frequent once. In that, we want to exclude person names that are already mentioned in the question text. To reiterate, the complete answer selection model is, of course, much more complicated and includes many more features such as textual similarity.

To conclude, I want to note that the original IBM Watson approach employs a lot of feature engineering. However, because the search in a large collection is reduced to a search in a much smaller subset of potentially relevant snippets, good results may be obtained by replacing manual feature engineering (and heuristics some of which are known from 60s) with neural network approaches similar to what is described in the DeepMind paper.

Summing up values inside 128 bit SSE or 256 bit AVX register: It's not easy to vectorize!

Imagine that you have 4 floating point values stored in a 128 bit SSE register or 8 floating point values stored in a 256 bit AVX register. How do you sum them up efficiently? One naive solution involves storing the values of the register to memory and summing them up using scalar operations. Is it possible to vectorize the summation efficiently? In other words, is it possible to sum up using SSE or AVX operations?

I tried some obvious solutions, in particular, an approach that uses the so called "horizontal" addition. I also tried to extract floating point values using SIMD operations and sum them up using scalar addition, but without storing the original floating point values to memory.

None of the approaches worked faster than the simpler scalar approach. Maybe the reason is that SIMD operations have high latency. In contrast, scalar additions have smaller latency and the CPU can execute several of them in parallel thanks to superscalarity. Yet, I am not 100% sure that this explanation is good. In any case, it seems to be hard to sum up values of SSE/AVX registers efficiently. As usual, my code is available for scrutiny.

UPDATE: in the first versions there was a bug that didn't show up, because I tested using only integer numbers. After fixing this bug, performance of a shuffle-based version increased a bit, but it's still much slower than the scalar addition.

Accurate BM25 similarity for Lucene

In this blog post, I explain why Lucene's BM25 implementation is not accurate and propose its efficient replacement. I will also cover the following three little-known topics related to BM25 and sometimes other similarity models:

  1. Lossy document length encoding in Lucene indexing;
  2. An arcane method of index-time boosting (and why you probably don't want to use it);
  3. An omission in Lucene's indexing tutorial related to choosing the right similarity during indexing.

The efficiency and effectiveness of my BM25 replacement is verified using two collections created from community question-answering data. One collection is publicly available so that my experiments can be easily reproduced by people without academic affiliations (see code in my GitHub repo). I think this should be especially interesting to researchers using Lucene's BM25 as a baseline.

Among other similarity models, Lucene employs the BM25 similarity. It is a variant of the TF*IDF scheme, where the normalized term frequency (i.e., TF) is computed using the following formula:

\text{freq} \cdot (k_1 + 1)
\text{freq} + k_1 \cdot \left(1 - b + b \cdot |D| \cdot \text{iboost}^{-2} \cdot |D|^{-1}_{\text{avg}} \right)
}, \textbf{(*)}

where freq is a raw, i.e., unnormalized term frequency, $|D|$ is a document length in words, $|D|_{\text{avg}}$ is an average document length, and iboost is an index-time boosting factor ($k_1$ and $b$ are parameters).

First, let us talk about the boosting factor. It works by reducing the document length and, consequently, the denominator in the equation (*). Hence, increase in iboost leads to increase in the normalized term frequency and, thus, in the overall score. However, the relationship between the index-time boosting factor is quite convoluted and I really doubt that such a boosting scheme is usable in practice.

One may wonder why the index-time boosting is implemented in such an unusual fashion as opposed to introducing a simple multiplicative factor. The reason is that Lucene's API does not support such multiplicative factors directly. Therefore, a developer of a BM25 similarity class had to bundle index-time boosting with computation of the document-length normalization factor. As we read in the documentation for the latest Lucene version: " At indexing time, the indexer calls computeNorm(FieldInvertState), allowing the Similarity implementation to set a per-document value for the field that will be later accessible via LeafReader.getNormValues(String). Lucene makes no assumption about what is in this norm, but it is most useful for encoding length normalization information. "

It is the hook function computeNorm(FieldInvertState) that computes the value $|D| \cdot \text{iboost}^{-2}$ and compresses it into a one-byte value. Because of this lossy compression, there are only 256 possible normalization factors. Therefore, we can precompute the value of $k_1 \cdot \left(1 - b + b \cdot |D| \cdot \text{iboost}^{-2} \cdot |D|^{-1}_{\text{avg}} \right)$ that participates in Eq. (*) and avoid recomputation during query-time.

This memoization technique seems to result in a noticeable speed up, but it also substantially degrades the quality of Lucene's BM25 ranking (due to the lossy normalization compression). In what follows, I will describe experiments where (depending on the effectiveness metric and data type) performance loss is 5-10%. The collections that I use are rather small: 4-6 million short documents. I suspect that the degradation becomes more noticeable as the collection size increases. This may not matter in all applications, of course, but it is quite aggravating if you use Lucene BM25 as one of the baselines in your experiments.

Before I proceed with the experiments, I want to highlight that document length normalization factors may be (and mostly are) incompatible among different similarities. For this reason, one need to use exactly the same similarity during both indexing (see, e.g., my code here) and retrieval. This fact, however, seems to be missing from the Lucene's demo/tutorial file. If you use Lucene 6 and BM25, this will not be a problem, because BM25 is now the default similarity (and BM25 parameters are not used during the computation of the document length normalization). Yet, this would be a problem in Lucene 4 or 5, where the default similarity is different from BM25. Likewise, if you implement a custom similarity class, you may need to specify it both during indexing and retrieval.

For the purpose of experiments, I use two community question answering (QA) data sets: Yahoo! Answers collection L6 (shortly Comprehensive) and Stack Overflow (code excluded). These collections are used to assess effectiveness and efficiency of two BM25 implementation for Lucene. The first implementation is the standard BM25Similarity in Lucene 6. The second implementation (class BM25SimilarityFix) is a modification of Lucene's similarity class. This modification does not use an approximation for the document length.

The access to Yahoo! Answers Comprehensive collection is, unfortunately, restricted to people from academia. The Stack Overflow collection can be freely downloaded, see my GitHub repo for details. From a each collection, I extract questions and their corresponding best answers. As far as I understand, a best answer is selected by the user who asks the question. Questions that are not answered and questions for which there is no selected best answer are ignored. The resulting collections have 4.4 million QA pairs for Comprehensive and 6.2 million QA pairs for StackOverflow

Community QA data allows us to test the quality of retrieval algorithms by measuring the accuracy at the task of retrieving answers by using respective questions as queries. While the overall effectiveness of such method may not be good enough to be useful in practice, we can experiment with large collections of queries without the need to manually annotate thousands (or millions) retrieval results as it is done, e.g., in the TREC evaluation.

More specifically, I first retrieve 100 most highly ranked documents and compare effectiveness using several standard IR metrics: the precision/accuracy at rank one (P@1), the recall at rank 10 (Recall@10), the mean average precision (MAP) and the overall answer recall (which is technically Recall@100). A disadvantage of the community QA data is that there may be more than one relevant answer when users submit similar questions. Given a question, relevant best answers posted for similar questions might be even more relevant than the best answer posted for this question. I personally think that such outcomes should be quite infrequent, but I do not have good numbers to back up my hypothesis. In any case, we will keep in mind that the accuracy at rank 1 (P@1) and the mean reciprocal rank might be slightly biased. However, I do not see a good reason why a better system should not find respective best answers more frequently, in particular, among top 10 highest ranked results. In other words, I think we can pretty much trust metrics such as recall at rank 10 (Recall@10).

The above described collections are used to assess effectiveness and efficiency of two BM25 implementations. To this end, I search using the same set of 10 thousand queries 11 times (each set of queries uses 10 thousands first questions from a collection). The hardware is Intel(R) Xeon(R) CPU E5-1410 @ 2.80GHz with 10 MB of cache and 32 GB of RAM. The tests are run on Linux using Java 8. The first retrieval run is used to "warm up" the index. The results for the following 10 runs are used to compute efficiency.

Most evaluation work is done by the script run_eval_queries, which also computes p-values using the two-sided paired t-test (for this you need R and Python). Before doing this, of course, you would need to create a Lucene index. A more detailed description of the experimental process is given in the README file.

Retrieval time
Retrieval time
SD (ms)
P@1 Recall@10 MAP Recall@100
Comprehensive (Yahoo Answers!)
Lucene BM25 38.2 3.5 0.0722 0.1666 0.1043 0.2925
Accurate BM25 39.5 1.3 0.0768 0.1742 0.1098 0.2969
Gain 3.4% 6.4% 4.6% 5.2% 1.5%
p-value 9E-05 6E-08 1E-12 0.0015
Stack Overflow
Lucene BM25 338.2 24.2 0.065 0.1494 0.0937 0.2927
Accurate BM25 356.4 19.2 0.0712 0.1588 0.1009 0.3037
Gain 5.4% 9.5% 6.3% 7.7% 3.8%
p-value 1E-06 5E-09 2E-16 2E-09

The experimental results are given in the table. First, we can see that the standard Lucene's implementation is, indeed, a tad faster: by 3% for Comprehensive and by 5% for StackOverflow (for a more reliable comparison, though, I should have carried out more experiments, because currently these differences are within one standard deviation from respective means). At the same, the standard implementation is substantially less effective. In particular, for Comprehensive it is 6.4% worse in P@1 and 4.6% worse in Recall@10. These substantial differences are also statistically significant (the statistically significance tests generate tiny p-values).

The difference is smaller if we consider recall at rank 100. This is not especially surprising, because our collections are relatively small (4.4M and 6.2M indexed answers). So, in many cases Lucene is able to find relevant answers, but it cannot rank them high enough using the current implementation of BM25. My guess is that the gap between implementations would increase if many more answers were indexed.

I am not going to speculate whether a 3-5% loss in efficiency is worth a 5-10% gain in accuracy. Ultimately, the user would decide. However, if you employ Lucene BM25 as a baseline in IR experiments, you should probably not use the standard Lucene BM25 similarity due to the potential loss in accuracy. Of course, I may be wrong. So, I encourage the readers to scrutinize my code.

To conclude, I would note that it would be possible to further optimize the existing similarity (while keeping it accurate), if we could recompute normalization factors after the collection is created. Specifically, we could have precomputed the value $k_1 \cdot \left(1 - b + b \cdot |D| \cdot \text{iboost}^{-2} \cdot |D|^{-1}_{\text{avg}} \right) $ that participates in Eq. (*) for each document (technically, we would also need to store the float in the long-type format, however, it is possible by converting a float via floatToRawIntBits). However, such an precomputation is not possible with the current API.

NMSLIB 1.5 is out

We have recently released a new version of our Non-Metric Space Library (NMSLIB). NMSLIB is an efficient and extendible toolkit for searching in generic non-metric spaces. Our killer feature is generality. Despite being general, some of the methods may outperform the celebrated LSH in the spaces where LSH methods seem to work best. For more details, please, check my reflections on the current state of the library and future work.


Subscribe to RSS - blogs