









Submitted by srchvrs on Fri, 02/14/2020  08:48
PREAMBLE:When dealing with retrieval, I have been traditionally using TREC NIST evaluation tools (trec_eval and gdeval) for information retrieval. Despite these tools are old, there has been a good amount of effort invested into making them right. Unfortunately, you have to call them as an external tool. Your program forks and runs out of memory. Despite Linux fork is lazy and does not really copy memory, it still happens. It happens even if you use the posix_spawn function and Python's spawntype creation of new processes: multiprocessing.set_start_method('spawn')
The issue: I decided to switch to scikitlearn or a similarlyinterface code (e.g., MatchZoo classes) to compute the IR metrics. I crosscompared results and I have come to the conclusion that very likely all scikitlearnlike packages are fundamentally broken when it comes to computing the mean average precision (MAP) and the normalized discounted cumulative gain NDCG
To compute both of the metrics, one needs two things:
 The list of relevant documents, where the relevance label can be binary or graded
 The list of scored/ranked documents.
Ideally, an evaluation tool could ingest this data directly. However, sklearn and other libraries cut the corner by accepting two arrays: y_score and y_true. Effectively each document is paired with its relevance grade, see, e.g., scikitlearn MAP.
Unfortunately, such an evaluation ignores all relevant documents, which are not returned by the system. In that, both NDCG and MAP have a normalizing factor that depends on the number of relevant documents. For example, in my understanding, if your system finds only 10% of all relevant documents, the scikitlearn MAP would produce a 10x larger MAP score compared to NIST trec_eval (and the Wikipedia formula). NDCG is still affected by this issue but to a lesser degree, because scores for omitted relevant documents will be heavily discounted.
I have created the notebook to illustrate this issue using onequery example and the MAP metric. By the way, for some reason, scikitlearn refuses to compute NDCG on this data and fails with a weird error.
Related reading: MAP is often (but not always) a meaningless metric if you do intrinsic evaluation of the kNN search.
Submitted by srchvrs on Tue, 02/04/2020  20:27
About fivesix years ago, I discovered that a default Lucene BM25 similarity was giving me suboptimal results, apparently due to a lossy encoding of document lengths (which was a part of Lucene's efficiency trick). I found this when I reimplemented BM25 on my own, but without a lossy document encoding. On my data, the difference was about 10%, which was far from being a trifle. I have run a good number of experiments where this difference was present. It was clearly not a random fluke or mirage. I eventually created a benchmark and published a blog post. I even made some noise on the Lucene dev list and promised to submit a patch. However, this did not happen as I got busy and Lucene changed its internal API.
Recently I was fortunate enough to revisit this problem thanks to Chris Kamphuis, Arjen P. de Vries, and Jimmy Lin who took me aboard their "Which BM25 Do You Mean? A LargeScale Reproducibility Study of Scoring Variants". They also did most of the work by testing several BM25 variants of which my accurate Lucene similarity was only a small piece. Somewhat surprisingly, the "often maligned approximation of document length" in Lucene performed nearly as well as the accurate similarity. Another result is that there are only very small differences among various BM25 implementations. I think it is an important finding on which I reflect in the very end of the post (please read that last paragraph).
Now, there are two voices in my head: one that "maligns the approximation of the document length" and another that says this approximation is ok. How should we reconcile the voices? Because the scope and the size of the paper did not permit a more thorough experimentation and description, I have carried an additional code analysis that has not been included into the paper. This analysis is below.
My original experiments were run with Lucene 6 (and earlier versions). Lucene 6 does not encode a document length directly. Instead, it approximates the inverse square root of the length. Thus, it introduces an approximation error for basically every possible document length! Lucene 7 supports the old scheme, but already introduces a new encoding of a document length, which stores small numbers (less than 25) exactly and retains four most significant binary digits for large numbers (see my test code), which is basically a variant of signfree exponentshifted quarterprecision format (additionally they count only the number of unique terms, which reduces the value of a document length that needs to be encoded). I think that this new approximation scheme is much more accurate .
Thus, I have to disagree a bit with somewhat optimistic conclusions of our paper that it does not matter which BM25 implementations to use. It seems to be true only for sufficiently careful implementations of BM25, including the recent Lucene's one. However, it is also clearly possible to screw up BM25 rather easily.
In conclusion, I would like to note that results of our paper should be treated in a broader context. There is somewhat anecdotal knowledge that various papers reported different effectiveness values for BM25 similarity on identical collections. Some people (including me) tended to think it was due to differences in BM25 implementations. However, the paper by Trotman et al showed that it was likely due to confounding factors such as the choice of lemmatization/stemming, tokenization, stopping, and data cleaning algorithms: Trotman, A., Puurula, A., & Burgess, B. (2014, November). Improvements to BM25 and language models examined. Clearly, our results support the conclusions made by Trotman et al.
Submitted by srchvrs on Mon, 02/03/2020  14:59
Submitted by srchvrs on Sat, 02/01/2020  19:46
About one year ago, a Quanta magazine published an article on the universal method to sort complex information. It is concerned with a theoretical work on solving a nearestneighbor search problem for various metric distances. Even more precisely, this work attempts to answer a question about what kind of a distance metric permits an efficient nearest neighbor search. Though this is surely an important and solid theoretical work, the Quanta magazine completely ignores the fact that from the practical perspective this problem has satisfactory solutions. Not only existing methods work for metric distances, good results can often be obtained for weird nonmetric dissimilarities (or distances as we like to call them). Sometimes, they work when these distances are even nonsymmetric!
Are these methods universal? They certainly are not, but nothing is universal in nearestneighbor search. There are challenging data sets, which cannot be searched efficiently even with the Euclidean distance! This issue as well as the history of nonmetric kNN search is briefly surveyed in my thesis. However, in some cases we can do really well by using a treebased or a neighborhoodgraph based approaches. In my thesis, I carried out a series of evaluations to verify this. I am pleased that all of the main results are now formally published, in particular, including two recent SISAP papers:

Boytsov, L., Nyberg. E., 2019. Accurate and Fast Retrieval for Complex Nonmetric Data via Neighborhood Graphs.

Boytsov, L., Nyberg. E., 2019. Pruning Algorithms for LowDimensional Nonmetric kNN Search: A Case Study.
I think these papers are concerned with important research questions and I am going to briefly highlight results.
Neighborhoodgraphs is a class of newold methods, which delivers state of the art results on many data sets. However, little is known how they behave on nonsymmetric distances. We were probably the first to test them on nonsymmetric distances such as KLdivergence [1, 2]. Turns out, however, these tests relied on data sets that were only mildly nonsymmetric. In the followup work, we have really stress tested them and discovered the following:

It is never a good idea to deal with nonsymmetric distances by symmetrizing the distance first and using the symmetrized distance as a part of a filterandrefine pipeline.

However, it is not even necessary. In many cases, indeed, neighborhoodgraphs deliver stateoftheart performance out of the box.

Importantly, one has to be consistent in the order of distance function arguments (although there are exceptions as I describe below). If the indexing procedure relies on a different order (e.g., by mistake), the results could be disastrous (I have made this mistake and it cost me a lot of time).

That said, using a different distance function at index time can produce sometimes better results. Again this is not a universal property. One somewhat obvious choice of possibly better indextime distance function is a symmetrized variant of the original distance. Quite surprisingly, the argumentreversed distance can deliver good results too, but, as I explain above, the results can be disastrous for some other datasets and distances. I think this discovery begs a research question: what is the optimal distancetime function?
Although graphbased retrieval is stateoftheart for highdimensional data it can be an overkill for lowdimensional data, where treebased approaches can work really well. In particular, we compare two approaches to adapt standard metric tree methods to nonmetric similarities. One is the effective piecewiselinear modification of the pruning rule, which we published at NIPS in 2013. In fact, for the Euclidean distance, it is as efficient as the classic projectionbased LSH. However, due to the linear nature of the approximation, it is sometimes not a good fit for nonmetric dissimilarities. In contrast, Tomas Skopal TriGen algorithm can be better in this case.
TriGen is an ingenious algorithm that finds a monotonic distance transformation that makes a dissimilarity look more like metric. However, TriGen has two drawbacks: it does not work out of the box with nonsymmetric distances and its implementation of the distancemodifying transformation can be a bit expensive. What we show is that, perhaps, the best solution is a hybrid: First, we can apply a cheap concave (or near concave) distance transformation such as the square root. Second, we can fit a piecewiselinear decision function for this transformed distance.
In conclusion, I want to emphasize that, although nearestneighbor search has no universal solution, there are a number of working generaldistance approaches. Some good solutions are implemented in NMSLIB, which is the first generic library for metric and nonmetric kNN search.
Submitted by srchvrs on Tue, 01/28/2020  17:00
Eight years ago I started my blog with a post on the origins of dynamic programming. Therein, I argue that the term programming stems from a military definition of the word "program", which simply means planning and logistics. In mathematics, this term was adopted to denote optimization problems and gave rise to several names such as integer, convex, nonlinear, and differentiable programming. I promised to describe how dynamic programming had a somewhat rocky start in computational biology in a followup post, but never delivered on this promise.
It has been a decade of phenomenal success of another programming concept, namely the differential programming. Three neural networks pioneers: Geoffrey Hinton, Yoshua Bengio, and Yann LeCun (with a regretful omission of Jürgen Schmidhuber) received a Turing award for "For conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing." Now it seems to be perfect time to deliver on the promise and wrap up with historical dynamic programming posts.
As I mentioned in my first blog post, dynamic programming is a relatively simple way to solve complex problems through an elegant and efficient recursion. In particular, it is at the heart of evolutionary distances in computational biology and the Levenshtein (also known as an edit) distance in natural language processing. Different as they are, these fields rely on string comparison via variants of an edit distance. The simplest way to compute an unweighted edit distance between strings $a=a_1 a_2 \ldots a_n$ and $b = b_1 b_2 \ldots b_m$ is through a following simple recursion:
$$
d_{i+1,j+1} = \min \left\{
\begin{array}{c}
d_{i,j+1} + 1 \\
d_{i+1,j} + 1 \\
d_{i,j} + (a_{i+1} \ne b_{j+1})\\
\end{array}
\right.
$$
The computational cost is quadratic. Although there are faster averagecase algorithms, there is little hope that the worstcase time is strongly subquadratic.
The formula is simple, but not always immediately obvious. In fact, it took the scientific community almost ten years to fully realize that essentially the same approach provides a solution for several fields: information retrieval, bioinformatics, and speech recognition (see my survey for a list of early works). Furthermore, as I explain below, a renowned mathematician Stanislaw Ulam not only failed to discover the formula, but also failed to recognize the solution when it was presented to him by David Sankoff. Next time when you fail to solve an apparently "simple" dynamic programming puzzle, do not beat yourself up!
The edit distance was first published by Levenshtein (and is often called the Levenshtein distance) in the context of selfcorrecting binary codes (Levenshtein, Vladimir I. "Binary codes capable of correcting deletions, insertions, and reversals." Soviet physics doklady. Vol. 10. No. 8. 1966.). Imagine a noisy channel that modifies binary messages by changing, inserting, and deleting bits occasionally. If such spurious changes are not very frequent, it may still be possible to recover the original message. This is clearly possible but only if the code of one symbol is sufficiently different from other symbol codes (i.e., when the distance is large enough).
Consider an example, where the set of codes contains only two 3bit codes 000 and 111 and the channel cannot modify more than one bit. Then, a noisy version of 000 will always be different from the noisy version of the code 111. Indeed, the noisy version of 000 can have at most one bit equal to one, while the noisy version of 111 always has at least two bits equal to one. By counting the number of unit bits, we can always recover the original code.
Clearly, there is a lot of waste here, because we use only two values out of eight possible ones. However, this is the price we pay to separate the noisy variants of the codes. Levenstein himself was only interested in estimating the amount of waste that is necessary to make the garbled codes separable. He actually proved that the waste would not be that bad: It is possible to use about $2^n/n$ separable code words of length n (bits).
Despite Levenshtein paper is massively cited, I think few people read it, in particular, because it was not possible to access this paper online. For my survey on approximate dictionary searching, I had to visit the actual Russian library to read the paper. It was a bit complicated, because at that time I already resided in the US.
Levenstein apparently did not realize that the distance he introduced could be useful in text processing applications, such as spellchecking, speech recognition, and alignment of DNA (or protein) sequences. In the 60s, the bioinformatics was in its early stage. The structure of the DNA was discovered in 1953, but the first complete gene was not decoded until 1972! Nevertheless, there was apparently a lot of interest in the problem of sequencing and finding similar regions among sequences of different species (note, however, this is not the only goal of finding similar sequences).
The usefulness of the latter method is based on the assumption that similarities in genetic sequences represent common ancestry. Two species start from a common genome and then follow different evolutionary paths, which results in changing certain areas of the original DNA. Yet, most subsequences of the DNA remain very similar. In a simplified model, the Nature "edits" a sequence by randomly deleting, inserting, or modifying certain small areas of the genes. These changes happen with a certain rate: The longer is a time frame, the more changes happen. By measuring the volume of the differences between two species (using some form of the edit distance) we can estimate the time required to evolve from a single common ancestor. (See, e.g., Nei, M., & Zhang, J. (2006). Evolutionary Distance: Estimation for more details.)
Stanislaw Ulam, a famous mathematician who played one of the pivotal roles in the Manhattan project, is often credited with the invention of the evolutionary distance. However, as argued by David Sankoff, he failed to realize that the distance can be computed by a simple dynamic programming algorithm. Turns out that dynamic programming was not so simple after all.
Pages










