log in | about 
 

MNIST is super easy and few people know it!

One can never be too surprised by the phenomenal success of the MNIST dataset, which is used in so many image publications. But do people realize how easy this dataset is? One clear measure of hardness is performance of a simplistic k-NN classifier with vanilla L2 metric directly on pixels. As a variant: performance of the k-NN classifier with some basic unsupervised transformations such as the principal component analysis (PCA) or denoising.

I created a small poll to assess what people think about MNIST's k-NN search accuracy. I thank everybody for participation: Fortunately, more than one hundred people responded (most of them are machine learning practitioners and enthusiasts I assume). So, I think the results are rather reliable.

In summary, nearly 40% of the respondents think that the accuracy would be at most 80%, 45% think the accuracy is 95%. Unfortunately, I did not create the option for 90%. I think it would have had quite a few responses as well. That said the vanilla k-NN search on pixels has 97% accuracy and the combination of the PCA and the k-NN classifier has nearly 98% accuracy (here is a notebook to back up 98% claim.). In fact, with a bit of additional pre-processing such as deskewing and denoising, one can get a nearly 99% accuracy.

Turns out that few people realize how effective the k-NN classifier is on MNIST: only 17% voted for 98%. That said, it does not mean that the k-NN classifier is such a good method overall (it can be good for tabular data, see, e.g., this paper by Shlomo Geva, but not for complex image data, check, e.g., out numbers for CIFAR and IMAGENET). It means, however, that MNIST is very easy. Understandably, people need some toy dataset to play and quickly get results with. One better alternative is the fashion MNIST. However, it is not too hard either. A vanilla k-NN classifier has about 85% accuracy and it is probably possible to push the accuracy close to 90% with a bit of preprocessing. Thus, we may need a comparably small, but much more difficult dataset to replace both of them.



Hello precision my old friend!

PREAMBLE:When dealing with retrieval, I have been traditionally using TREC NIST evaluation tools (trec_eval and gdeval) for information retrieval. Despite these tools are old, there has been a good amount of effort invested into making them right. Unfortunately, you have to call them as an external tool. Your program forks and runs out of memory. Despite Linux fork is lazy and does not really copy memory, it still happens. It happens even if you use the posix_spawn function and Python's spawn-type creation of new processes: multiprocessing.set_start_method('spawn')

The issue: I decided to switch to scikit-learn or a similarly-interface code (e.g., MatchZoo classes) to compute the IR metrics. I cross-compared results and I have come to the conclusion that very likely all scikit-learn-like packages are fundamentally broken when it comes to computing the mean average precision (MAP) and the normalized discounted cumulative gain NDCG

To compute both of the metrics, one needs two things:

  1. The list of relevant documents, where the relevance label can be binary or graded
  2. The list of scored/ranked documents.

Ideally, an evaluation tool could ingest this data directly. However, sklearn and other libraries cut the corner by accepting two arrays: y_score and y_true. Effectively each document is paired with its relevance grade, see, e.g., scikit-learn MAP.

Unfortunately, such an evaluation ignores all relevant documents, which are not returned by the system. In that, both NDCG and MAP have a normalizing factor that depends on the number of relevant documents. For example, in my understanding, if your system finds only 10% of all relevant documents, the scikit-learn MAP would produce a 10x larger MAP score compared to NIST trec_eval (and the Wikipedia formula). NDCG is still affected by this issue but to a lesser degree, because scores for omitted relevant documents will be heavily discounted.

I have created the notebook to illustrate this issue using one-query example and the MAP metric. By the way, for some reason, scikit-learn refuses to compute NDCG on this data and fails with a weird error.

Related reading: MAP is often (but not always) a meaningless metric if you do intrinsic evaluation of the k-NN search.



Accurate Lucene BM25 : Redux

About five-six years ago, I discovered that a default Lucene BM25 similarity was giving me sub-optimal results, apparently due to a lossy encoding of document lengths (which was a part of Lucene's efficiency trick). I found this when I reimplemented BM25 on my own, but without a lossy document encoding. On my data, the difference was about 10%, which was far from being a trifle. I have run a good number of experiments where this difference was present. It was clearly not a random fluke or mirage. I eventually created a benchmark and published a blog post. I even made some noise on the Lucene dev list and promised to submit a patch. However, this did not happen as I got busy and Lucene changed its internal API.

Recently I was fortunate enough to revisit this problem thanks to Chris Kamphuis, Arjen P. de Vries, and Jimmy Lin who took me aboard their "Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants". They also did most of the work by testing several BM25 variants of which my accurate Lucene similarity was only a small piece. Somewhat surprisingly, the "often maligned approximation of document length" in Lucene performed nearly as well as the accurate similarity. Another result is that there are only very small differences among various BM25 implementations. I think it is an important finding on which I reflect in the very end of the post (please read that last paragraph).

Now, there are two voices in my head: one that "maligns the approximation of the document length" and another that says this approximation is ok. How should we reconcile the voices? Because the scope and the size of the paper did not permit a more thorough experimentation and description, I have carried an additional code analysis that has not been included into the paper. This analysis is below.

My original experiments were run with Lucene 6 (and earlier versions). Lucene 6 does not encode a document length directly. Instead, it approximates the inverse square root of the length. Thus, it introduces an approximation error for basically every possible document length! Lucene 7 supports the old scheme, but already introduces a new encoding of a document length, which stores small numbers (less than 25) exactly and retains four most significant binary digits for large numbers (see my test code), which is basically a variant of sign-free exponent-shifted quarter-precision format (additionally they count only the number of unique terms, which reduces the value of a document length that needs to be encoded). I think that this new approximation scheme is much more accurate .

Thus, I have to disagree a bit with somewhat optimistic conclusions of our paper that it does not matter which BM25 implementations to use. It seems to be true only for sufficiently careful implementations of BM25, including the recent Lucene's one. However, it is also clearly possible to screw up BM25 rather easily.

In conclusion, I would like to note that results of our paper should be treated in a broader context. There is somewhat anecdotal knowledge that various papers reported different effectiveness values for BM25 similarity on identical collections. Some people (including me) tended to think it was due to differences in BM25 implementations. However, the paper by Trotman et al showed that it was likely due to confounding factors such as the choice of lemmatization/stemming, tokenization, stopping, and data cleaning algorithms: Trotman, A., Puurula, A., & Burgess, B. (2014, November). Improvements to BM25 and language models examined. Clearly, our results support the conclusions made by Trotman et al.



Bringing a large Russian QA data set to light

"It is achingly apparent that an overwhelming amount of research in speech and language technologies considers exactly one human language: English." (Kyle Gorman) For this reason Emily Bender has been famously encouraging people to (1) explicitly name languages they work on (2) do more work on non-English-data. This has become known as a Bender rule.

Despite the importance of multilingual NLP, frankly speaking, it has been difficult to have an opportunity to work on non-English data (in the previous decade my only major opportunity was a stint on cross-lingual metaphor detection). I am therefore very pleased to have been recently participating in bringing to light a large Russian question-answering/reading-comprehension (QA) data set SberQuAD, which was created similarly to SQuAD.

I have been helping my co-authors Pavel Efimov and Pavel Braslavski (who did nearly all the work) to analyze and describe this data set. We have conducted a very thorough analysis and evaluated several powerful models. The full analysis is available online, but here I would like to highlight the following:

SberQuAD was created similarly to Stanford SQuAD. Yet, despite the similarities, all the models perform worse on SberQuAD than on SQuAD, which can be attributed to having only a single answer variant and fewer answers that are named entities. A lot of answers in SberQuAD still often contain an entity, but it is normally only a part of an answer. This stands in contrast to SQuAD where roughly half of the answers are named entities.



Universal Methods to Sort Complex Information Tested

About one year ago, a Quanta magazine published an article on the universal method to sort complex information. It is concerned with a theoretical work on solving a nearest-neighbor search problem for various metric distances. Even more precisely, this work attempts to answer a question about what kind of a distance metric permits an efficient nearest neighbor search. Though this is surely an important and solid theoretical work, the Quanta magazine completely ignores the fact that from the practical perspective this problem has satisfactory solutions. Not only existing methods work for metric distances, good results can often be obtained for weird non-metric dissimilarities (or distances as we like to call them). Sometimes, they work when these distances are even non-symmetric!

Are these methods universal? They certainly are not, but nothing is universal in nearest-neighbor search. There are challenging data sets, which cannot be searched efficiently even with the Euclidean distance! This issue as well as the history of non-metric k-NN search is briefly surveyed in my thesis. However, in some cases we can do really well by using a tree-based or a neighborhood-graph based approaches. In my thesis, I carried out a series of evaluations to verify this. I am pleased that all of the main results are now formally published, in particular, including two recent SISAP papers:

  1. Boytsov, L., Nyberg. E., 2019. Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs.

  2. Boytsov, L., Nyberg. E., 2019. Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study.

I think these papers are concerned with important research questions and I am going to briefly highlight results.

Neighborhood-graphs is a class of new-old methods, which delivers state of the art results on many data sets. However, little is known how they behave on non-symmetric distances. We were probably the first to test them on non-symmetric distances such as KL-divergence [1, 2]. Turns out, however, these tests relied on data sets that were only mildly non-symmetric. In the follow-up work, we have really stress tested them and discovered the following:

  1. It is never a good idea to deal with non-symmetric distances by symmetrizing the distance first and using the symmetrized distance as a part of a filter-and-refine pipeline.

  2. However, it is not even necessary. In many cases, indeed, neighborhood-graphs deliver state-of-the-art performance out of the box.

  3. Importantly, one has to be consistent in the order of distance function arguments (although there are exceptions as I describe below). If the indexing procedure relies on a different order (e.g., by mistake), the results could be disastrous (I have made this mistake and it cost me a lot of time).

  4. That said, using a different distance function at index time can produce sometimes better results. Again this is not a universal property. One somewhat obvious choice of possibly better index-time distance function is a symmetrized variant of the original distance. Quite surprisingly, the argument-reversed distance can deliver good results too, but, as I explain above, the results can be disastrous for some other datasets and distances. I think this discovery begs a research question: what is the optimal distance-time function?

Although graph-based retrieval is state-of-the-art for high-dimensional data it can be an overkill for low-dimensional data, where tree-based approaches can work really well. In particular, we compare two approaches to adapt standard metric tree methods to non-metric similarities. One is the effective piecewise-linear modification of the pruning rule, which we published at NIPS in 2013. In fact, for the Euclidean distance, it is as efficient as the classic projection-based LSH. However, due to the linear nature of the approximation, it is sometimes not a good fit for non-metric dissimilarities. In contrast, Tomas Skopal TriGen algorithm can be better in this case.

TriGen is an ingenious algorithm that finds a monotonic distance transformation that makes a dissimilarity look more like metric. However, TriGen has two drawbacks: it does not work out of the box with non-symmetric distances and its implementation of the distance-modifying transformation can be a bit expensive. What we show is that, perhaps, the best solution is a hybrid: First, we can apply a cheap concave (or near concave) distance transformation such as the square root. Second, we can fit a piecewise-linear decision function for this transformed distance.

In conclusion, I want to emphasize that, although nearest-neighbor search has no universal solution, there are a number of working general-distance approaches. Some good solutions are implemented in NMSLIB, which is the first generic library for metric and non-metric k-NN search.



Pages

Subscribe to RSS - blogs