log in | about 

Using Nearest-Neighbor Search in Machine Learning and Natural Language Processing

We have been working on improving our Non-Metric Space Library (NMSLIB), a toolkit for searching in generic spaces. First we carried out a new evaluation for a reasonably diverse data set. The results recently appeared in PVLDB 15.

Second, we participated in a public evaluation. The results confirmed that our implementations are quite competitive. More specifically, the small-world graph approach proposed by Malkov et al. fared very well against FLANN and kgraph.

However, our work is far from being finished. We may now attempt to apply our toolkit to NLP problems. I summarized thoughts on this topic in a talk at ML lunch.

Equally important we try to make our toolkit easier to use. Because we originally cared mostly about efficiency of experimentation and publishing, a few important features are missing. We are now trying to fill the gap.

In addition, we plan to carry out more comprehensive evaluations that would allow us to better understand the problem at hand as well as to devise methods that work well for a broader class of non-metric spaces.

There is no love between academy and industry!

Before the keynote talk at Hawaii, the chair Volker Markl greeted the speaker Juan Loaiza (Senior Vice President of Systems Technology at Oracle) with a Hawaiian lei. A lei is traditionally accompanied with a kiss. The day before, a Turing award winner Mike Stonebraker was greeted in exactly this way. So, Juan Loaiza jokingly reminded that Volker was supposed to kiss him. After Volker ignored the comment, Juan summarized: there is no love between academy and industry.

Anti-advice: Don't tune your BM25 coefficients!

A short anti-advice on tuning: Please, never, ever tune the BM25 coefficients or coefficients of any other retrieval model. Just stick to the values that worked well for "other" data sets. Otherwise, your numbers may be improved by as much as 5%.

Dictionary-less spellchecking and subword language models

It was asked on Quora about dictionary-less spellchecking (this may require subword language models). Here is my brief answer (feel free to vote on Quora).

In the early days, memory was limited, therefore people tried to avoid storing comprehensive dictionaries. Errors could have been detected by finding substrings that looked "unusual" for a given language. For example, if you compile a list of common trigrams, you could flag words containing trigrams that are off the list (see, e.g., Spellchecking by Computer by R. Mitton).

English has little morphology. In morphologically productive languages, it was often possible to compile a list of word generation rules. For example, given a root form of a verb, it would be possible to obtain all possible inflections for singular and plural forms in the simple present and other tenses. This trick provides some compaction in the case of English, however, many more verb forms can be generated in, e.g., Russian (i.e., a more compact representation is possible compared to the comprehensive dictionary that includes all word forms.).

More sophisticated checking can be done in the case of Agglutinative languages (e.g., Turkish, Finish), where a single word can be equivalent to a full-fledged sentence in English. Agglutinative langauges are typically regular: there are rules to add affixes to roots. Regular does not necessarily mean simple, however. For example, in Turkish there are all kind of suffix and root deformations that occur when word parts are assembled (see this paper for details: Design and implementation of a spelling checker for Turkish by A Solak, K Oflazer).

Perhaps, most relevant to the question are sub-word level language models (including, of course, statistical character-level models). Previously, modelling was done using n-gram language models. As pointed out recently by Yoav Goldberg, these models are quite powerful: The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool).

However, Recurrent Neural Networks (RNNs) seem to be a more effective tool. A couple more relevant references on RNNs for language modelling:

  2. Generating Sequences With Recurrent Neural Networks by Alex Graves

What these models do: they learn how likely is a combination of certain characters or n-grams in a real word. Errors are, therefore, detected by finding character sequences that looked "unusual" for a given language (unusual means the character sequence didn't match training data well).

The links posted by Wenhao Jiang may also relevant, however, they seem to address a different problem: given an already existing word-level language model find the most probable separation of character (or phoneme) sequences into words. Word segmentation IMHO requires even more data than a dictionary: it requires normally a word-level language model that encompasses the dictionary as a special case. In the case of an out-of-vocabulary (OOV) word, the n-gram model also backs off to using a character-level (or subword-level) language model.

A brief overview of query/sentence similarity functions

I was asked on Quora about similarity functions that could be used to compare two pieces of text, e.g., two sentences, or a sentence and a query. Here is my brief answer (feel free to vote on Quora). All similarity functions that I know can be classified into purely lexical, syntactical, based on logic (deduction), and based on vectorial word representations aka word embeddings.

Purely lexical functions include well-known approaches such as the edit distance, the longest common sequence, the degree of (skipped) n-gram overlap, and tf-idf approaches. These are very well-known measures that can be used either as is or normalized by a sentence/document length. A good overview of these approaches is given in the IBM Watson paper "Textual evidence gathering and analysis" by Murdock et al.

In addition to plain-text document representations, there were attempts to represent documents as weighted vectors of relevant concepts. A similarity can then be computed in multiple ways, e.g., using the dot product. One well-known approach dubbed as Explicit Semantic Analysis (ESA) determines which Wikipedia concepts can be associated with text words and phrases. This technique is known as entity-linking.

A quick note on tf-idf measures. These include many classic query-to-text similarity metrics from the information retrieval domain, such as BM25. In one approach, one sentence or document is considered to be a query and another a document. However, there are also symmetrized methods: see e.g., Whissell, John S., and Charles LA Clarke. "Effective measures for inter-document similarity." . In any case, there should be some corpus (a large set of document), so that reliable corpus statistics can be collected.

Syntactic functions take into account not only the words themselves, but also their syntactic relations. For example, if sentences are similar we may expect that both subjects and verbs are similar to some degree. To be able to use syntactic similarity, one needs to obtain either a syntax or a dependency tree using a special parser. Then, trees can be compared using various algorithms. Some of those algorithms (e.g., tree kernels) are discussed in the above mentioned IBM paper. Another good reference to study is "Tree edit models for recognizing textual entailments, paraphrases, and answers to questions" by Heilman and Smith. It looks like in many cases, comparing complete trees is an overkill. Instead, one can use very small pieces (parent-child nodes) instead. For an example, see the paper "Learning to rank answers to non-factoid questions from web collections" by Surdeanu et al.

Sometimes syntactic similarity functions are called semantic (though semantic involved is typically quite shallow). In particular, one can use semantic role labeling instead of dependency parsing.

Logic approaches don't seem to work for complex texts and domains. Nevertheless, see the above-mentioned IBM paper for some references.

Last, but not least, word embeddings can be quite useful. Given two word embeddings for two words, a similarity can be computed as the cosine similarity function between word vectors (or a different similarity function if it works better than the cosine similarity). To compare complete sentences, two approaches can be used. First, you can use special methods to compute sentence-level embeddings. Second, you can just average individual word embeddings and compare averages. This seem to work quite well. One well-known related reference is "A neural network for factoid question answering over paragraphs" by Iyyer et al. There are at least two ways to generate word embeddings. One is neural networks (see word2vec program) and another is plain vanilla Latent semantic analysis (LSA).

To see which function is best, you need a training and a test set. Ideally, you would compute several similarity functions and apply a machine learning method to compute a fusion score.


Subscribe to RSS - srchvrs's blog