log in | about 
 

Anti-advice: Don't tune your BM25 coefficients!

A short anti-advice on tuning: Please, never, ever tune the BM25 coefficients or coefficients of any other retrieval model. Just stick to the values that worked well for "other" data sets. Otherwise, your numbers may be improved by as much as 5%.



Dictionary-less spellchecking and subword language models

It was asked on Quora about dictionary-less spellchecking (this may require subword language models). Here is my brief answer (feel free to vote on Quora).

In the early days, memory was limited, therefore people tried to avoid storing comprehensive dictionaries. Errors could have been detected by finding substrings that looked "unusual" for a given language. For example, if you compile a list of common trigrams, you could flag words containing trigrams that are off the list (see, e.g., Spellchecking by Computer by R. Mitton).

English has little morphology. In morphologically productive languages, it was often possible to compile a list of word generation rules. For example, given a root form of a verb, it would be possible to obtain all possible inflections for singular and plural forms in the simple present and other tenses. This trick provides some compaction in the case of English, however, many more verb forms can be generated in, e.g., Russian (i.e., a more compact representation is possible compared to the comprehensive dictionary that includes all word forms.).

More sophisticated checking can be done in the case of Agglutinative languages (e.g., Turkish, Finish), where a single word can be equivalent to a full-fledged sentence in English. Agglutinative langauges are typically regular: there are rules to add affixes to roots. Regular does not necessarily mean simple, however. For example, in Turkish there are all kind of suffix and root deformations that occur when word parts are assembled (see this paper for details: Design and implementation of a spelling checker for Turkish by A Solak, K Oflazer).

Perhaps, most relevant to the question are sub-word level language models (including, of course, statistical character-level models). Previously, modelling was done using n-gram language models. As pointed out recently by Yoav Goldberg, these models are quite powerful: The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool).

However, Recurrent Neural Networks (RNNs) seem to be a more effective tool. A couple more relevant references on RNNs for language modelling:

  1. SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS by Mikolov et al
  2. Generating Sequences With Recurrent Neural Networks by Alex Graves

What these models do: they learn how likely is a combination of certain characters or n-grams in a real word. Errors are, therefore, detected by finding character sequences that looked "unusual" for a given language (unusual means the character sequence didn't match training data well).

The links posted by Wenhao Jiang may also relevant, however, they seem to address a different problem: given an already existing word-level language model find the most probable separation of character (or phoneme) sequences into words. Word segmentation IMHO requires even more data than a dictionary: it requires normally a word-level language model that encompasses the dictionary as a special case. In the case of an out-of-vocabulary (OOV) word, the n-gram model also backs off to using a character-level (or subword-level) language model.



A brief overview of query/sentence similarity functions

I was asked on Quora about similarity functions that could be used to compare two pieces of text, e.g., two sentences, or a sentence and a query. Here is my brief answer (feel free to vote on Quora). All similarity functions that I know can be classified into purely lexical, syntactical, based on logic (deduction), and based on vectorial word representations aka word embeddings.

Purely lexical functions include well-known approaches such as the edit distance, the longest common sequence, the degree of (skipped) n-gram overlap, and tf-idf approaches. These are very well-known measures that can be used either as is or normalized by a sentence/document length. A good overview of these approaches is given in the IBM Watson paper "Textual evidence gathering and analysis" by Murdock et al.

In addition to plain-text document representations, there were attempts to represent documents as weighted vectors of relevant concepts. A similarity can then be computed in multiple ways, e.g., using the dot product. One well-known approach dubbed as Explicit Semantic Analysis (ESA) determines which Wikipedia concepts can be associated with text words and phrases. This technique is known as entity-linking.

A quick note on tf-idf measures. These include many classic query-to-text similarity metrics from the information retrieval domain, such as BM25. In one approach, one sentence or document is considered to be a query and another a document. However, there are also symmetrized methods: see e.g., Whissell, John S., and Charles LA Clarke. "Effective measures for inter-document similarity." . In any case, there should be some corpus (a large set of document), so that reliable corpus statistics can be collected.

Syntactic functions take into account not only the words themselves, but also their syntactic relations. For example, if sentences are similar we may expect that both subjects and verbs are similar to some degree. To be able to use syntactic similarity, one needs to obtain either a syntax or a dependency tree using a special parser. Then, trees can be compared using various algorithms. Some of those algorithms (e.g., tree kernels) are discussed in the above mentioned IBM paper. Another good reference to study is "Tree edit models for recognizing textual entailments, paraphrases, and answers to questions" by Heilman and Smith. It looks like in many cases, comparing complete trees is an overkill. Instead, one can use very small pieces (parent-child nodes) instead. For an example, see the paper "Learning to rank answers to non-factoid questions from web collections" by Surdeanu et al.

Sometimes syntactic similarity functions are called semantic (though semantic involved is typically quite shallow). In particular, one can use semantic role labeling instead of dependency parsing.

Logic approaches don't seem to work for complex texts and domains. Nevertheless, see the above-mentioned IBM paper for some references.

Last, but not least, word embeddings can be quite useful. Given two word embeddings for two words, a similarity can be computed as the cosine similarity function between word vectors (or a different similarity function if it works better than the cosine similarity). To compare complete sentences, two approaches can be used. First, you can use special methods to compute sentence-level embeddings. Second, you can just average individual word embeddings and compare averages. This seem to work quite well. One well-known related reference is "A neural network for factoid question answering over paragraphs" by Iyyer et al. There are at least two ways to generate word embeddings. One is neural networks (see word2vec program) and another is plain vanilla Latent semantic analysis (LSA).

To see which function is best, you need a training and a test set. Ideally, you would compute several similarity functions and apply a machine learning method to compute a fusion score.



The first use of plus operator in an online search engine

Remember an old plus operator obsoleted by Google in 2011? In many search engines, including Google before October 2011, this operator is used to indicate a mandatory (or an important) keyword. Do you know when was the plus operator first used in an online search engine? I bet few would guess that it happened half a century ago, in the pre-Internet era:

In 1965, TEXTIR permitted users to do some search term weighting. By preceding a term with a plus sign, a searcher could direct TEXTIR to increase the score assigned of that word, and thus raise the score of the source document that contained that word.

How was the online search possible before Internet? One could use a phone line (and apparently a dial-up modem):

Queries were sent to SDC's Q-32 computer in Santa Monica via telephone from a Teletype Model 35 terminal ... In response, the system ... transmitted the texts of retrieved reports back by Teletype in relevance rank order.

Source: A History of Online Information Services 1963-1976 by C. P. Bourne and T.B. Hahn.

The online search service, one of the first of the kind, was developed and provided by the System Development Corporation (SDC). SDC is considered to be the first software company in the world.



It is not the ideas that are overrated, it is the implementation that is undervalued

I think that we, as a society, have come to an important realization: The notion of the Idea Person, who effortlessly produces a stream of ingenious ideas to be implemented by less intelligent underlings, needs to be deflated. At least, many of us do understand that good ideas are not born easily. In contrast, a good idea is a result of a tedious selection process that involves experimentation, reading, backtracking, hard work, and exchange of knowledge. It is also not unusual that the idea evolves substantially in the course of implementation. Yet, little or no credit goes to an Implementation Person.

As a result of the existing imbalance, some people have come to another extreme conclusion: Ideas are not valuable. Here, I have to disagree. Not all ideas are worthless. The problem is that it is hard to distinguish between a good and a bad idea until an implementation is attempted. Nevertheless, a good idea is an important ingredient of progress: Success is not possible without proper implementation, but it is not possible without good ideas either. As it was put by my co-author Anna, it is not the ideas that are overrated, it is the implementation that is undervalued.



Pages

Subscribe to RSS - srchvrs's blog