A Non-Metric Space Library (NMSLIB) is an ongoing effort to build an effective toolkit for searching in generic metric and non-metric spaces. It originated as a personal project of my co-author Bileg (Bilegsaikhan Naidan), who is also a major contributor. In the spring of 2013, we needed to evaluate similarity search methods for non-metric spaces. Yet, no good software suite was available (nor was there a collection of benchmarks). This is why we created one by expanding Bileg's codebase and making it public.
There are important contributions from other people as well. In particular, I would like to mention Yury Malkov and David Novak. Yury has contributed a blazingly fast variant of the proximity/neighborhood graph . David has proposed a way to greatly improve the quality of pivot-based projections for sparse vector spaces.
We presented our main experimental results at NIPS'13 & VLDB'15 and our engineering work at SISAP'13. Some of the experimental results are available in our VLDB'15 report. We also participated in a public evaluation, codenamed ann-benchmarks in 2015 and 2016. Both times our graph-based implementations were quite competitive against state-of-the-art methods designed specifically for the Euclidean and the angular distance: random-projection KD-trees (FLANN and Annoy) and LSH (MIT's FALCONN). Please, check our May 2016 results in the end of the post. It may very well be new state-of-the-art in kNN search.
While our methods are fast in the well-studied Euclidean space equipped with scalar product, they also generalize well to non-Euclidean, non-metric, and non-vector spaces. For example, check our VLDB'15 evaluation. Fast search times often come at the expense of longer indexing times. From this perspective, LSH and random projection methods do outperform NMSLIB. However, the two approaches are not mutually exclusive. For example, it may be possible to rely on recent major advances in LSH (the efficient FALCONN library) to build proximity graphs in the Euclidean space faster.
From the engineering perspective, our implementation is mature. Nearly all methods can run in a multi-threaded mode. We have Python bindings and a multi-threaded query server. In particular, there is a native Java client. In general, we steadily drift in the direction of a production-friendly library. However, our main focus is on the ease of experimentation and comparison to the existing baselines. Here, our work is far from being finished. In particular we now attempt to apply our toolkit to NLP problems. I summarized some thoughts on this topic in a talk at ML lunch. Currently, we work on replacing a term-based retrieval in the ad hoc search task. Some preliminary results will appear in Leo's thesis proposal.
Appendix: our 2016 ann-benchmark results. It is our own re-run using only the best methods on the c4.2xlarge Amazon EC2 instance. The results are very similar to the results obtained by Eric in May-June 2016, but there are some minor differences due to, e.g., selecting a different random subset of queries:
1.19M 100d GloVe, cosine similarity.
1M 128d SIFT features, Euclidean distance: