log in | contact | about 

A brief overview of query/sentence similarity functions

I was asked on Quora about similarity functions that could be used to compare two pieces of text, e.g., two sentences, or a sentence and a query. Here is my brief answer (feel free to vote on Quora). All similarity functions that I know can be classified into purely lexical, syntactical, based on logic (deduction), and based on vectorial word representations aka word embeddings.

Purely lexical functions include well-known approaches such as the edit distance, the longest common sequence, the degree of (skipped) n-gram overlap, and tf-idf approaches. These are very well-known measures that can be used either as is or normalized by a sentence/document length. A good overview of these approaches is given in the IBM Watson paper "Textual evidence gathering and analysis" by Murdock et al.

In addition to plain-text document representations, there were attempts to represent documents as weighted vectors of relevant concepts. A similarity can then be computed in multiple ways, e.g., using the dot product. One well-known approach dubbed as Explicit Semantic Analysis (ESA) determines which Wikipedia concepts can be associated with text words and phrases. This technique is known as entity-linking.

A quick note on tf-idf measures. These include many classic query-to-text similarity metrics from the information retrieval domain, such as BM25. In one approach, one sentence or document is considered to be a query and another a document. However, there are also symmetrized methods: see e.g., Whissell, John S., and Charles LA Clarke. "Effective measures for inter-document similarity." . In any case, there should be some corpus (a large set of document), so that reliable corpus statistics can be collected.

Syntactic functions take into account not only the words themselves, but also their syntactic relations. For example, if sentences are similar we may expect that both subjects and verbs are similar to some degree. To be able to use syntactic similarity, one needs to obtain either a syntax or a dependency tree using a special parser. Then, trees can be compared using various algorithms. Some of those algorithms (e.g., tree kernels) are discussed in the above mentioned IBM paper. Another good reference to study is "Tree edit models for recognizing textual entailments, paraphrases, and answers to questions" by Heilman and Smith. It looks like in many cases, comparing complete trees is an overkill. Instead, one can use very small pieces (parent-child nodes) instead. For an example, see the paper "Learning to rank answers to non-factoid questions from web collections" by Surdeanu et al.

Sometimes syntactic similarity functions are called semantic (though semantic involved is typically quite shallow). In particular, one can use semantic role labeling instead of dependency parsing.

Logic approaches don't seem to work for complex texts and domains. Nevertheless, see the above-mentioned IBM paper for some references.

Last, but not least, word embeddings can be quite useful. Given two word embeddings for two words, a similarity can be computed as the cosine similarity function between word vectors (or a different similarity function if it works better than the cosine similarity). To compare complete sentences, two approaches can be used. First, you can use special methods to compute sentence-level embeddings. Second, you can just average individual word embeddings and compare averages. This seem to work quite well. One well-known related reference is "A neural network for factoid question answering over paragraphs" by Iyyer et al. There are at least two ways to generate word embeddings. One is neural networks (see word2vec program) and another is plain vanilla Latent semantic analysis (LSA).

To see which function is best, you need a training and a test set. Ideally, you would compute several similarity functions and apply a machine learning method to compute a fusion score.

The first use of plus operator in an online search engine

Remember an old plus operator obsoleted by Google in 2011? In many search engines, including Google before October 2011, this operator is used to indicate a mandatory (or an important) keyword. Do you know when was the plus operator first used in an online search engine? I bet few would guess that it happened half a century ago, in the pre-Internet era:

In 1965, TEXTIR permitted users to do some search term weighting. By preceding a term with a plus sign, a searcher could direct TEXTIR to increase the score assigned of that word, and thus raise the score of the source document that contained that word.

How was the online search possible before Internet? One could use a phone line (and apparently a dial-up modem):

Queries were sent to SDC's Q-32 computer in Santa Monica via telephone from a Teletype Model 35 terminal ... In response, the system ... transmitted the texts of retrieved reports back by Teletype in relevance rank order.

Source: A History of Online Information Services 1963-1976 by C. P. Bourne and T.B. Hahn.

The online search service, one of the first of the kind, was developed and provided by the System Development Corporation (SDC). SDC is considered to be the first software company in the world.

It is not the ideas that are overrated, it is the implementation that is undervalued

I think that we, as a society, have come to an important realization: The notion of the Idea Person, who effortlessly produces a stream of ingenious ideas to be implemented by less intelligent underlings, needs to be deflated. At least, many of us do understand that good ideas are not born easily. In contrast, a good idea is a result of a tedious selection process that involves experimentation, reading, backtracking, hard work, and exchange of knowledge. It is also not unusual that the idea evolves substantially in the course of implementation. Yet, little or no credit goes to an Implementation Person.

As a result of the existing imbalance, some people have come to another extreme conclusion: Ideas are not valuable. Here, I have to disagree. Not all ideas are worthless. The problem is that it is hard to distinguish between a good and a bad idea until an implementation is attempted. Nevertheless, a good idea is an important ingredient of progress: Success is not possible without proper implementation, but it is not possible without good ideas either. As it was put by my co-author Anna, it is not the ideas that are overrated, it is the implementation that is undervalued.

Efficient grapecounting in your vineyard via passive computer vision

Believe it or not, the USA is the largest consumer of wine that guzzles more than 10% of all wine produced on the planet. However, it lags somewhat in production. Turns out that maximizing grape yield relies heavily on measurements during the growing season, in particular, on crop estimation. If certain areas are underperforming, it is often possible to fix the issue by, e.g., additional irrigation and fertilization.

Crop estimation is an expensive labor-intensive process that was previously carried out only by humans. The Robotics Institute of Carnegie Mellon University (in collaboration with Cornell University and stakeholders) works on developing automated measuring techniques. At Carnegie Mellon University, the group is lead by Stephen Nuske.

What is truly astonishing is that the proposed technology relies only on passive vision techniques, which are considered unreliable to be used outside a lab. Unlike self-driving cars requiring expensive laser-powered sensing devices called LIDARs, the proposed technology uses only a camera. The camera resides on a small cart that drives at a speed of about 5mph (if I remember correctly, there is also a flash to neutralize variability in lighting). While driving, the camera makes overlapping pictures of grape vines. Obtained images are processed to detect individual grapes and count them!

Although image recognition algorithms have reached a certain level of maturity, it is still challenging to detect individual grapes, because there are millions of potential locations to check in a single picture. This is especially hard when grapes did not ripen (and consequently both leaves and grapes are green). However, the researchers from the Robotics Institute of Carnegie Mellon University can count grapes even in real time! To accomplish this complex task, they use a combination of a quick high-recall low-precision filtering algorithm and a more accurate algorithm that removes false matches. The high-recall low-precision algorithm is an ensemble of two relatively simple key-point detection algorithms. The approach is described in a series of publications. The overall accuracy seems to be pretty good and the technology might be commercialized in not-so-distant future.

To conclude, I would like to note that, in addition to grape counting in your vineyard, Stephen Nuske worked on several other cool projects, where passive vision was applied to real-world problems. These may be interesting to both practitioners and lab scientists specializing in computer vision.

Algorithms to merge sorted lists or arrays

I have written a rather thorough description of algorithms that one can use to merge sorted lists or arrays of integers. Feel free to vote for this description on Quora. Here I decided to duplicate my answer (slightly revised and improved).

The choice of the merging algorithm depends on (1) the distribution of data (2) the hardware that you use. In that, there are several major approaches or a combination thereof that can be used:

  1. Classic k-way intersection with the priority queue. I believe it's described in Knuth. All the lists should be sorted in advance. You read the smallest values from each list and put them into the queue. More specifically, you put the pair (value, list id). Then you extract the smallest value using the queue and output it. If it came from list K, you extract the smallest value from the list K and push the smallest pair (value, K) to the priority queue (while simultaneously removing it from list K) . And so on so forth.

    Priority queue is not especially fast, in particular, because working with a queue entails a lot of branching (can be slow on both CPUs and GPUs due to branch misprediction). Therefore, other approaches may be more efficient sometimes.

  2. Pairwise merge sort. It is a well-known algorithm, so I won't describe it here. However, if you merge two lists, where one is much shorter than other methods can do better.

    In particular, you can iterate through a shorter list and find an insertion point in the large list using an exponential search (a fancier and more efficient version of the binary search). We used this approach in the context of list intersection, but the same method works well for unions.

  3. Using bitmasks. If your lists are represented as bitmasks, merge is super fast. Extraction of the result can be a bit tricky. However, using modern CPU instructions, you can do it rather easily. Daniel Lemire covered the topic of bitmap decoding extensively. Alternatively, one can use hashing.

    Encoding the whole list as a bitmap can be wasteful. This is why, people use some hybrid approaches where only a part of the list is encoded as a bitmap. If you have a sorted list as an input, it can actually may make sense to convert it first to a bitmap and then carry out a union/intersection using the bitmap.

  4. Using the ScanCount algorithm. Imagine that the minimum number is zero and the maximum number is M. You can create a table with M+1 elements that are all set to zero initially. To carry out a merge, you have to iterate over lists that you merge. If, during the iteration you encounter the number X, you set the element X in the table to one (or increment it if you need to know the number of lists that contain the number). Finally, you iterate over the (M+1)-element table and check which elements are non-zero. Bonus: input lists do not have to be sorted!

    The table may have byte or bit elements. Zeroing table elements before merging can be done in several ways. One very simple approach is via the library function memset (it's memset in C/C++ may have different name in other languages). Though this seems to be naive, memset can zero about 10 billion integers per second for cache-resident data. See the test program here.
    ScanCount can be surprisingly efficient.

    To fit data into cache, you need to reuse the same small table M ~ 60-100K elements. In practice, of course your numbers will be larger than M. However, you can split your inputs and process each split separately.

  5. To conclude, I would mention that there are more advanced so-called adaptive algorithms, which I don't remember off the top of my head. Google something like "adaptive list intersection", or "adaptive list merging".


Subscribe to RSS - blogs