## There is (almost) no difference between the cosine similarity and the Euclidean distance when it comes to normalized word embeddings (or other vectors)

One well-studied NLP task is answering analogy questions. For example: to whom the word woman relates in a similar way as the word man relates to the word king. This is a well-researched topic: there have been a lot of progress made and a human-level performance was reported at least ten years ago[1]. There are also some recent results by Mikolov and colleagues, who used neural-network based word embeddings constructed in a non-supervised fashion [2]. More specifically, neural networks feed on huge amounts of text and spit out real-valued word vectors.

There are interesting linguistic regularities related to these vectors. For example, the vector of man minus the vector of king is approximately equal to the vector of woman minus the vector of king. Therefore, argue Mikolov and colleagues, the word queen may be the most similar word to the following vector: king minus man plus woman. This type of regularity holds for many pairs of analogous words. Therefore, analogy question can be answered by solving a maximization problem, which is equivalent to a kNN search[2]:

$$\textbf{queen} = \mbox{argmax}_w\; \mbox{similarity}(w, \textbf{king}- \textbf{man} + \textbf{woman})$$

The similarity among vectors is computed using the cosine similarity. Levy et al. 2004 introduced a multiplicative maximization objective 3COSMUL[3], which was shown to be better than the cosine similarity. However, Pennington et al. could not reproduce this finding on their data sets[4].

Usually word embeddings are normalized so that their Euclidean distance is equal to one. Then, the cosine similarity is equal to the dot product. What is, perhaps, more interesting here (and few people seem to realize this!) is that in this case the cosine similarity produces the same results as the Euclidean distance. The cosine similarity is not equal to the Euclidean distance, but it is obtained by a monotonic transformation, which is a decreasing function, of the Euclidean distance. For a trivial proof, please, refer to the Wikipedia. What does it mean? Obviously, when two vectors have the largest cosine similarity (i.e., they are nearest neighbors with respect to this similarity metric), the Euclidean distances between them is the smallest. However, as noted by Hamed Zamani, there may be a difference if similarity values are used by downstream applications.

My personal pet peeve in regard to the above-mentioned linguistic regularities is as follows: for many word embeddings, the vector closest to the vector king - man + woman is not a queen, but actually a king! In fact, there seems to be a trend to place words already appearing as a part of the analogy question before true answers! (For more examples of this phenomena, please, see slide 13 of my recent talk at the ML lunch.)

To summarize, indeed we can answer analogy questions by carrying out a knn-search in the space of word embeddings, but we have to ignore words already appearing in the question! BTW, this hack is rarely mentioned and in fact it is omitted from in the original Mikolov et al's NAACL paper [2]. It is also apparently missing in the paper on Glove embeddings [4] (though it is indeed mentioned by Levy et al. [3]). This cost me a couple of hours of scratching my head with subsequent reading the code of word2vec when I tried to obtain analogy-question answers myself.

## Using Nearest-Neighbor Search in Machine Learning and Natural Language Processing

We have been working on improving our Non-Metric Space Library (NMSLIB), a toolkit for searching in generic spaces. First we carried out a new evaluation for a reasonably diverse data set. The results recently appeared in PVLDB 15.

Second, we participated in a public evaluation. The results confirmed that our implementations are quite competitive. More specifically, the small-world graph approach proposed by Malkov et al. fared very well against FLANN and kgraph.

However, our work is far from being finished. We may now attempt to apply our toolkit to NLP problems. I summarized thoughts on this topic in a talk at ML lunch.

Equally important we try to make our toolkit easier to use. Because we originally cared mostly about efficiency of experimentation and publishing, a few important features are missing. We are now trying to fill the gap.

In addition, we plan to carry out more comprehensive evaluations that would allow us to better understand the problem at hand as well as to devise methods that work well for a broader class of non-metric spaces.

## There is no love between academy and industry!

Before the keynote talk at Hawaii, the chair Volker Markl greeted the speaker Juan Loaiza (Senior Vice President of Systems Technology at Oracle) with a Hawaiian lei. A lei is traditionally accompanied with a kiss. The day before, a Turing award winner Mike Stonebraker was greeted in exactly this way. So, Juan Loaiza jokingly reminded that Volker was supposed to kiss him. After Volker ignored the comment, Juan summarized: there is no love between academy and industry.

A short anti-advice on tuning: Please, never, ever tune the BM25 coefficients or coefficients of any other retrieval model. Just stick to the values that worked well for "other" data sets. Otherwise, your numbers may be improved by as much as 5%.

## Dictionary-less spellchecking and subword language models

It was asked on Quora about dictionary-less spellchecking (this may require subword language models). Here is my brief answer (feel free to vote on Quora).

In the early days, memory was limited, therefore people tried to avoid storing comprehensive dictionaries. Errors could have been detected by finding substrings that looked "unusual" for a given language. For example, if you compile a list of common trigrams, you could flag words containing trigrams that are off the list (see, e.g., Spellchecking by Computer by R. Mitton).

English has little morphology. In morphologically productive languages, it was often possible to compile a list of word generation rules. For example, given a root form of a verb, it would be possible to obtain all possible inflections for singular and plural forms in the simple present and other tenses. This trick provides some compaction in the case of English, however, many more verb forms can be generated in, e.g., Russian (i.e., a more compact representation is possible compared to the comprehensive dictionary that includes all word forms.).

More sophisticated checking can be done in the case of Agglutinative languages (e.g., Turkish, Finish), where a single word can be equivalent to a full-fledged sentence in English. Agglutinative langauges are typically regular: there are rules to add affixes to roots. Regular does not necessarily mean simple, however. For example, in Turkish there are all kind of suffix and root deformations that occur when word parts are assembled (see this paper for details: Design and implementation of a spelling checker for Turkish by A Solak, K Oflazer).

Perhaps, most relevant to the question are sub-word level language models (including, of course, statistical character-level models). Previously, modelling was done using n-gram language models. As pointed out recently by Yoav Goldberg, these models are quite powerful: The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool).

However, Recurrent Neural Networks (RNNs) seem to be a more effective tool. A couple more relevant references on RNNs for language modelling:

What these models do: they learn how likely is a combination of certain characters or n-grams in a real word. Errors are, therefore, detected by finding character sequences that looked "unusual" for a given language (unusual means the character sequence didn't match training data well).

The links posted by Wenhao Jiang may also relevant, however, they seem to address a different problem: given an already existing word-level language model find the most probable separation of character (or phoneme) sequences into words. Word segmentation IMHO requires even more data than a dictionary: it requires normally a word-level language model that encompasses the dictionary as a special case. In the case of an out-of-vocabulary (OOV) word, the n-gram model also backs off to using a character-level (or subword-level) language model.