log in | about 
 

I was asked on Quora about similarity functions that could be used to compare two pieces of text, e.g., two sentences, or a sentence and a query. Here is my brief answer (feel free to vote on Quora). All similarity functions that I know can be classified into purely lexical, syntactical, based on logic (deduction), and based on vectorial word representations aka word embeddings.

Purely lexical functions include well-known approaches such as the edit distance, the longest common sequence, the degree of (skipped) n-gram overlap, and tf-idf approaches. These are very well-known measures that can be used either as is or normalized by a sentence/document length. A good overview of these approaches is given in the IBM Watson paper "Textual evidence gathering and analysis" by Murdock et al.

In addition to plain-text document representations, there were attempts to represent documents as weighted vectors of relevant concepts. A similarity can then be computed in multiple ways, e.g., using the dot product. One well-known approach dubbed as Explicit Semantic Analysis (ESA) determines which Wikipedia concepts can be associated with text words and phrases. This technique is known as entity-linking.

A quick note on tf-idf measures. These include many classic query-to-text similarity metrics from the information retrieval domain, such as BM25. In one approach, one sentence or document is considered to be a query and another a document. However, there are also symmetrized methods: see e.g., Whissell, John S., and Charles LA Clarke. "Effective measures for inter-document similarity." . In any case, there should be some corpus (a large set of document), so that reliable corpus statistics can be collected.

Syntactic functions take into account not only the words themselves, but also their syntactic relations. For example, if sentences are similar we may expect that both subjects and verbs are similar to some degree. To be able to use syntactic similarity, one needs to obtain either a syntax or a dependency tree using a special parser. Then, trees can be compared using various algorithms. Some of those algorithms (e.g., tree kernels) are discussed in the above mentioned IBM paper. Another good reference to study is "Tree edit models for recognizing textual entailments, paraphrases, and answers to questions" by Heilman and Smith. It looks like in many cases, comparing complete trees is an overkill. Instead, one can use very small pieces (parent-child nodes) instead. For an example, see the paper "Learning to rank answers to non-factoid questions from web collections" by Surdeanu et al.

Sometimes syntactic similarity functions are called semantic (though semantic involved is typically quite shallow). In particular, one can use semantic role labeling instead of dependency parsing.

Logic approaches don't seem to work for complex texts and domains. Nevertheless, see the above-mentioned IBM paper for some references.

Last, but not least, word embeddings can be quite useful. Given two word embeddings for two words, a similarity can be computed as the cosine similarity function between word vectors (or a different similarity function if it works better than the cosine similarity). To compare complete sentences, two approaches can be used. First, you can use special methods to compute sentence-level embeddings. Second, you can just average individual word embeddings and compare averages. This seem to work quite well. One well-known related reference is "A neural network for factoid question answering over paragraphs" by Iyyer et al. There are at least two ways to generate word embeddings. One is neural networks (see word2vec program) and another is plain vanilla Latent semantic analysis (LSA).

To see which function is best, you need a training and a test set. Ideally, you would compute several similarity functions and apply a machine learning method to compute a fusion score.