log in | about 
 

UPDATE: BM25 implementation has changed in recent Lucene versions. For more details, see the post Accurate Lucene BM25 : Redux.

A couple of months ago, I published a post on improving BM25 Lucene similarity by getting rid of lossy document length encoding. I demonstrated that for a community QA retrieval task, effectiveness of Lucene's BM25 ranking scheme can be quite a bit lower compared to the lossless BM25 implementation. However, I did not test using standard TREC collections. Now I am filling this gap. To summarize my results, the difference between two similarity implementations on standard collections is noticeably smaller compared to the difference on a community QA task. Yet, this difference still exists. One may think that community QA tasks are quirky and, perhaps, biased in some way. However, I tend to think that this discrepancy stems from the difference in the average query length: community QA queries are much longer than TREC-Web queries. For this reason, they may be more sensitive to inaccuracies in the ranking algorithm. In particular, Stack Overflow queries are the longest and this is the collection where the difference between two BM25 implementations is the largest. Note that this only a hypothesis: Additional experiments to refute/support this hypothesis are, of course, welcome. Below, I describe my experiments in more detail. The code is on GitHub.

For this set of experiments, I use subsets of two sizeable TREC collections: ClueWeb09 and ClueWeb12. Each of these subsets (called category B subsets) comprise about 50 million HTML documents. While the document collections are large, query (or topic) sets are quite modest. For ClueWeb09, I use 500 first topics (and respective relevance judgements) from the Million Query Track. I do not use any further topics, because their relevance judgments are too sparse (many queries have no judgments at all). For ClueWeb12 my original plan was to use a standard NIST TREC collection of queries. Unfortunately, it has merely 100 queries/topics. For this reason, I do not get anything even close to statistically significant differences. Plus, as we learn from our simulations, such small topic sets of queries are quite unreliable.

For these reasons, I use the derivative collection UQV100 created by Peter Bailey and colleageus. Bailey et al. took TREC Web topics (years 2012-2013) and created several query variants of each topic via crowdsourcing. For example, the topic raspberry pi generated variants such as: amazon raspberry pi, buy raspberry pi, cost of raspberry pi, and so on. Then, for each query variant Bailey et al. generated query responses and judged them. A tricky part here is that they have not released relevance judgements for specific queries. Instead, they have merged relevance judgements for queries within a single topic. I nevertheless assume that all generated queries for the same original topic share the same set of relevance judgements. Implementing this assumption requires duplication of relevance judgements (henceforth, QRELs). Specifically, each query within a topic receives the same set of QRELs (technically, this is done by my script scripts/merge_uqv100.py)

Evaluation results are in the table below. Unlike the the previous post I decide to use more standard IR metrics, namely, ERR@20 and NDCG@20. I also do not measure retrieval time for Web collections, because their indices do not fit into memory of my laptop. Timings for community QA data is given in the previous post.

NDCG@20 ERR@20
Comprehensive (Yahoo Answers!) 10K queries
Lucene BM25 0.1245 0.0064
Accurate BM25 0.1305 0.0067
Accuracy gain 4.8% 5.4%
p-value 2e-16 6e-13
Stack Overflow (10K queries)
Lucene BM25 0.1118 0.0057
Accurate BM25 0.1200 0.0061
Accuracy gain 7.4% 7.9%
p-value 2e-16 2e-16
ClueWeb09/One Million Queries (500 queries)
Lucene BM25 0.2621 0.0826
Accurate BM25 0.2699 0.0860
Accuracy gain 3% 4.1%
p-value 0.014 0.037
ClueWeb12/UQV100 (6099 queries)
Lucene BM25 0.1604 0.1813
Accurate BM25 0.1638 0.1851
Accuracy gain 2.1% 2.1%
p-value 2e-16 7e-7