About five-six years ago, I discovered that a default Lucene BM25 similarity was giving me sub-optimal results, apparently due to a lossy encoding of document lengths (which was a part of Lucene's efficiency trick). I found this when I reimplemented BM25 on my own, but without a lossy document encoding. On my data, the difference was about 10%, which was far from being a trifle. I have run a good number of experiments where this difference was present. It was clearly not a random fluke or mirage. I eventually created a benchmark and published a blog post. I even made some noise on the Lucene dev list and promised to submit a patch. However, this did not happen as I got busy and Lucene changed its internal API.
Recently I was fortunate enough to revisit this problem thanks to Chris Kamphuis, Arjen P. de Vries, and Jimmy Lin who took me aboard their "Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants". They also did most of the work by testing several BM25 variants of which my accurate Lucene similarity was only a small piece. Somewhat surprisingly, the "often maligned approximation of document length" in Lucene performed nearly as well as the accurate similarity. Another result is that there are only very small differences among various BM25 implementations. I think it is an important finding on which I reflect in the very end of the post (please read that last paragraph).
Now, there are two voices in my head: one that "maligns the approximation of the document length" and another that says this approximation is ok. How should we reconcile the voices? Because the scope and the size of the paper did not permit a more thorough experimentation and description, I have carried an additional code analysis that has not been included into the paper. This analysis is below.
My original experiments were run with Lucene 6 (and earlier versions). Lucene 6 does not encode a document length directly. Instead, it approximates the inverse square root of the length. Thus, it introduces an approximation error for basically every possible document length! Lucene 7 supports the old scheme, but already introduces a new encoding of a document length, which stores small numbers (less than 25) exactly and retains four most significant binary digits for large numbers (see my test code), which is basically a variant of sign-free exponent-shifted quarter-precision format (additionally they count only the number of unique terms, which reduces the value of a document length that needs to be encoded). I think that this new approximation scheme is much more accurate .
Thus, I have to disagree a bit with somewhat optimistic conclusions of our paper that it does not matter which BM25 implementations to use. It seems to be true only for sufficiently careful implementations of BM25, including the recent Lucene's one. However, it is also clearly possible to screw up BM25 rather easily.
In conclusion, I would like to note that results of our paper should be treated in a broader context. There is somewhat anecdotal knowledge that various papers reported different effectiveness values for BM25 similarity on identical collections. Some people (including me) tended to think it was due to differences in BM25 implementations. However, the paper by Trotman et al showed that it was likely due to confounding factors such as the choice of lemmatization/stemming, tokenization, stopping, and data cleaning algorithms: Trotman, A., Puurula, A., & Burgess, B. (2014, November). Improvements to BM25 and language models examined. Clearly, our results support the conclusions made by Trotman et al.