log in | contact | about 

What is a non-metric space library?

A Non-Metric Space Library (NMSLIB) is an ongoing effort to build an effective toolkit for searching in generic metric and non-metric spaces. It originated as a personal project of my co-author Bileg (Bilegsaikhan Naidan), who is also a major contributor. In the spring of 2013, we needed to evaluate similarity search methods for non-metric spaces. Yet, no good software suite was available (nor was there a collection of benchmarks). This is why we created one by expanding Bileg's codebase and making it public.

There are important contributions from other people as well. In particular, I would like to mention Yury Malkov and David Novak. Yury has contributed a blazingly fast variant of the proximity/neighborhood graph . David has proposed a way to greatly improve the quality of pivot-based projections for sparse vector spaces.

We presented our main experimental results at NIPS'13 & VLDB'15 and our engineering work at SISAP'13. Some of the experimental results are available in our VLDB'15 report. We also participated in a public evaluation, codenamed ann-benchmarks in 2015 and 2016. Both times our graph-based implementations were quite competitive against state-of-the-art methods designed specifically for the Euclidean and the angular distance: random-projection KD-trees (FLANN and Annoy) and LSH (MIT's FALCONN). Please, check our May 2016 results in the end of the post. It may very well be new state-of-the-art in kNN search.

While our methods are fast in the well-studied Euclidean space equipped with scalar product, they also generalize well to non-Euclidean, non-metric, and non-vector spaces. For example, check our VLDB'15 evaluation. Fast search times often come at the expense of longer indexing times. From this perspective, LSH and random projection methods do outperform NMSLIB. However, the two approaches are not mutually exclusive. For example, it may be possible to rely on recent major advances in LSH (the efficient FALCONN library) to build proximity graphs in the Euclidean space faster.

From the engineering perspective, our implementation is mature. Nearly all methods can run in a multi-threaded mode. We have Python bindings and a multi-threaded query server. In particular, there is a native Java client. In general, we steadily drift in the direction of a production-friendly library. However, our main focus is on the ease of experimentation and comparison to the existing baselines. Here, our work is far from being finished. In particular we now attempt to apply our toolkit to NLP problems. I summarized some thoughts on this topic in a talk at ML lunch. Currently, we work on replacing a term-based retrieval in the ad hoc search task. Some preliminary results will appear in Leo's thesis proposal.

Appendix: our 2016 ann-benchmark results. It is our own re-run using only the best methods on the c4.2xlarge Amazon EC2 instance. The results are very similar to the results obtained by Eric in May-June 2016, but there are some minor differences due to, e.g., selecting a different random subset of queries:

1.19M 100d GloVe, cosine similarity.
1M 128d SIFT features, Euclidean distance:

How to declare a constant reference in C++ (not really)

As we may remember, in C++ there are two types of constant pointers. The pointer of the first type (the most common one) can be changed, but not the memory it points to:

  1. const int * const_mem = ... ;
  2. *const_mem = 3; // compile error
The constant pointer of the second type is basically a reference and it cannot be changed, but you can still change respective memory:
  1. int * const const_ptr = ... ;
  2. *const_ptr = 3; // fine!
  3. const_ptr++; // compile error
Of course, you can define a constant pointer to constant memory as well:
  1. const int * const const_ptr_mem = ... ;
  2. *const_ptr_mem = 3; // compile error!
  3. const const_ptr_mem++; // compile error

References, however, are constant by design. You can assign reference a value only once. You cannot change the reference value afterwards! Thus, references are basically constant non-null pointers. Turns out that you can still define a constant reference in C++:

  1. int const & const_ref = 3;
Well, why would such non-sense thing be possible? The answer is that it is not. C and C++ have an extremely quirky way of declaring complex types with complicated rules, which are applied basically in a inside-out right-to-left fashion. Thus, in the previous declaration const still applies to int rather than to int&. In other words, the latter declaration is equivalent to:
  1. const int & const_mem_ref = 3;
To declare a true constant reference, which is unsurprisingly illegal, you need to place the modifier const between the '&' and the variable name:
  1. int & const const_ref = 3;

Bottom line? Hopefully, reading this short note will help one reduce confusion in the future. As usual, simple illustration code is available.

GCC disables isnan and isinf when compiling with -ffast-math flag

This short note is just a reminder that GCC totally ignores functions isinf and isnan when you compile your code with -ffast-math option. The demo code can be found here. One should also be aware that -ffast-math is enabled by a commonly used option -Ofast, but not by -ON option, where N is a number. I also wrote custom checking functions that do not have such problem.

A surprising novel stopword that appears if you use Stanford NLP tokenizer

I recently learned a new stopword that seems to be missing from most of the standard lists of stopwords (for example, it is not on the list of the Lemur/Indri toolkit), which likely means it is pretty novel to the IR community. This stop word is a simple three letter combination: n't. How does it arise? Well, it is a result of tokenization of contractions such as can't or aren't. But don't blindly trust my words, check the tokenization results yourself, e.g., using the following sentences (as a reminder this can be done using an online Stanford tool):

I ain't interested in this.
I can't attend this conference.
Aren't you hungry?
Don't trust me, verify!

Well, this may be correct linguistically, but this is not something the IR community is fully aware of. In particular, a stopword list may contain full contractions such as can't, ain't or don't, but not the suffix n't! If you work with a text where contractions are often, there you go, you have a new stop word! Inclusion of stop words into a query may not necessarily have effect on accuracy, but it will certainly hurt efficiency. BTW, other contractions may also produce interesting tokens, e.g., gonna is tokenized into gon and na. Yet, tokens like na seem to be far less frequent.

Why SAX sucks and XML iterator rulez when you have to parse Wikipedia or a similar collection

When it comes to data analysis data preparation is the most time-consuming task. According to one survey it takes 80% of time. For those who deal with semi-structured text collections such as Wikipedia, one of the most annoying problem is parsing. I am not talking about splitting documents into sentences, obtaining POS tags, and dependency trees. I mean a mundane task of extracting, e.g., Wikipedia title and text. Somewhat unexpectedly, this can be quite a pain in the ass.

Many of the text collections that I deal with have two things in common: (1) they are stored in XML format and (2) they have repeating entries of the same structure enclosed within a pair of unique tags (i.e., tags do not repeat inside the entry itself). In the case of Wikipedia, an entry is a Wikipedia article surrounded by the tags <page> and </page>. Because it is a large XML document, one has typically to resort to using an event-driven method called SAX.

Consider an example of such parsing code written by my fellow student Di Wang. As you can see, an event-driven approach is not easy to implement. A SAX parser tells you a few things like when it encounters a starting tag and everything else is your own headache. Basically, you need to keep some sort of a state variable and keep track of opening/closing tags. Not only is this tedious, but it is also error prone and fragile. You change the format of the document a bit and your code may stop working.

It would be a long post to explain everything what I hate about SAX parsing, but let me simply state that it sux in my opinion. What I would prefer instead is to parse everything using a DOM parser. Then, accessing necessary nodes would be a walk in the park. I do not have to care about parsing details, I can use things like XSLT and all sort of useful helper functions that work with an existing DOM tree. Buuuut, this approach is extremely memory inefficient.

Instead it would be nice to have something like an XML iterator that would go over the list of similarly-structured entities, parse one entry (e.g., a Wikipedia article) at a time, and generate a DOM tree only for this entry. How does one implement such a thing? Recall that each entry is enclosed by the pair of unique tags. Thus we can find the start/end of each entry and parse one entry using a DOM parser. Of course, there are some subtleties to be taken care of. For example, the enclosing tags may occasionally have attributes and document entries may have, e.g., CDATA sections. However, it should not be too complicated to implement such functionality.

This is exactly what I did when I got tired of using pesky SAX parsers. I have been using my "XML iterator" implementation for more than a year, but only recently did I extract the code so it can be used in a standalone fashion. The repository is on GitHub. It contains an XML iterator class as well as a Wikipedia parsing example. It can be executed by calling a script sample_run.sh. The code is in Java (8+). Feel free to (dis)like the code and send me pull requests should you find any problems.

The XML iterator does not do any deep XML parsing. It only extracts the text of document entries (one at a time). An entry should be enclosed by the unique tag. This means that the tag cannot be reused inside the document entry. On obtaining the next entry, you parse it using a DOM parser of your choice. You do not have to use the same DOM parser as I did. You can process DOM trees in a more elegant way than I did. For example, for complex documents, you can use XSLT/XPATH. To conclude I note that this approach is reasonably efficiently (and uses little memory), but it is not as efficient as the SAX parser. So, if the parsing speed is of paramount importance (which I doubt), then SAX is still your best friend.


Subscribe to RSS - blogs