log in | about 
 

Demystifying IBM Watson

This is written in response to a Quora question, which asks about internals of IBM Watson question answering (QA) system. Feel free to vote there for my answer! Previously I briefly compared IBM Watson approach to that of DeepMind, albeit without going into details of how IBM Watson works. Here I fill this gap.

I am not sure anybody knows exactly what was under the hood. However, there is a series of papers published by IBM most of which I read end-to-end more than once. One overview paper can be found here. The list of papers can be found here, most PDFs can be easily googled :-) There is also a lengthy (but quite relevant) survey (by an IBM Watson team member J. Prager) that covers some the details of the retrieval-based question answering:

Prager, John. "Open-domain question–answering." Foundations and Trends® in Information Retrieval 1.2 (2007): 91-231.

First things first: IBM Watson team incorporated both symbolic/logical systems and a classic redundancy-based retrieval QA into their system. However, there are only few questions (about 1%) that they were able to answer by logical inference and querying of structured knowledge sources.

I would reiterate that a vast majority of questions are answered using a carefully tuned retrieval-based system, which heavily relies on the fact that Jeopardy answers are factoids: short noun phrases such as named entities (e.g., dates, names of famous persons, or city names). Hence, the QA system does not really need to answer a question, e.g., by synthesizing an answer, or by doing some complicated inference. It should instead extract a potential answer and collect enough statistical evidence that this answer is correct.

And, indeed, a retrieval-based factoid QA system finds passages lexically matching the question and extracts potential answers from these passages. It then uses a carefully tuned statistical model to figure out which candidate answers are good. This model likely does not involve any sophisticated reasoning that humans are capable of. That said, I still consider IBM Watson as one of the greatest achievements in the AI field.

The fact that Jeopardy questions are long greatly helps to find the so-called candidate passages, which are likely to contain an answer. Finding these passages is based largely on the lexical overlap between the question and the answer passage. Stephen Wolfram even ran an experiment where he found that a single search engine can find candidate passages for nearly 70% of all answers.

Furthermore, there is a good coverage of Jeopardy topics in Wikipedia. I cite: "We conducted an experiment to evaluate the coverage of Wikipedia articles on Jeopardy! questions and found that the vast majority of Jeopardy! answers are titles of Wikipedia documents [10]. Of the roughly 5% of Jeopardy! answers that are not Wikipedia titles, some included multiple entities, each of which is a Wikipedia title, such as Red, White, and Blue, whereas others were sentences or verb phrases, such as make a scarecrow or fold an American flag." Chu-Carroll, Jennifer, et al. "Finding needles in the haystack: Search and candidate generation." IBM Journal of Research and Development 56.3.4 (2012): 6-1.

I have to say that just throwing a bag-of-words query into a search engine can be a suboptimal approach, but the IBM Watson team wrote a bunch of complex question-rewriting procedures (in Prolog!) to ensure these queries were good. Not all candidate passages are generated in this way: I have covered another generation approach in another blog post.

After candidate passages are retrieved, IBM Watson extracts potential answers, which is not a trivial task. How does it find them? The actual model is sure rather complicated, but it would largely look for named entities and more generic noun phrases. However, not all entities/phrases are weighted equally. What affects the weights? Three things:

  1. A type of the question and the type of the entity (or rather their compatibility score);
  2. Existence of additional supporting evidence;
  3. How frequently these entities/noun phrases appear in candidate passages.

For example, if the question is "Who is the mayor of Toronto?" we know that the answer is a person. Hence, we can downweigh named entities whose type is not a person. The actual answer typing processing is surely more complicated, and there is a separate paper describing it in more detail:

Murdock, J. William, et al. "Typing candidate answers using type coercion." IBM Journal of Research and Development 56.3.4 (2012): 7-1.

What is important is that incorporating other types of relations (e.g., spatial or temporal) in addition to the answer-question type compatibility did not seem to result in substantial improvements (though some gains were observed). See results in Tables 1 and 2 of the paper:

Kalyanpur, Aditya, et al. "Structured data and inference in DeepQA." IBM Journal of Research and Development 56.3.4 (2012): 10-1.

Furthermore, for each candidate entry X, we can try to construct a query like "X is a mayor of Toronto" and find matching passages with good lexical overlap with this additional evidencing query. If such passages exist, they provide evidence that X is, indeed, an answer to the question.

There is a separate paper devoted to the evidencing process:

Murdock, J. William, et al. "Textual evidence gathering and analysis." IBM Journal of Research and Development 56.3.4 (2012): 8-1.

Last, but not least, the ranking approach (for candidate answers) takes into account the (weighted) number of occurrences. In other words, we expect true answers to appear more frequently in retrieved candidate passages. Although this assumption seems to be a bit simplistic it works well due to redundancy: There are lot of answer passages for simple well-known factoids. A nice paper exploring this phenomenon was written by Jimmy Lin:

Lin, Jimmy. "An exploration of the principles underlying redundancy-based factoid question answering." ACM Transactions on Information Systems (TOIS) 25.2 (2007): 6.



Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching: We Need More k-NN Search!

The final version of my thesis "Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching" is now available online. An important by-product of my research is an efficient NMSLIB library, which I develop jointly with other folks. In a podcast with Radim Řehůřek (Gensim author) I discuss this project, its goals, and its history in detail.

Although efficiency is an important part of the thesis, it is primarily not about efficiency. Most importantly, I try to deliver the following messages:

  1. We have very flexible retrieval tools, in particular, graph-based retrieval algorithms, which can work well for a wide variety of similarity functions. In other words, we do not have to limit ourselves neither to inner-product similarities (e.g., the Euclidean distance) nor to even metric spaces.
  2. When "queries" are long, these algorithms can challenge traditional term-based inverted files. So, in the future, I expect retrieval systems to rely less on classic term-based inverted files and more on generic k-NN search algorithms (including graph-based retrieval algorithms). I think it is not a question of "IF", but rather a question of "WHEN".

Graph-based retrieval is an old new idea, which has been around for more than twenty years. This idea is beautifully simple: Build a graph where sufficiently close points are connected by edges. Such graphs come in various flavors and can be collectively called neighborhood graphs or proximity graphs. Given a neighborhood graph, nearest neighbor and other queries can be answered (mostly only approximately) by traversing the graph in a direction towards the query (and starting from, e.g., a random node). I cover the history of this idea in my thesis in more detail, but the earliest reference for this approach that I know is the seminal paper by Sunil Arya and David Mount (BTW, David Mount is co-author of the well-known ANN library).

Despite this early discovery, the practicality of graph-based retrieval in high-dimensional spaces was limited because we did not know how to construct neighborhood graphs efficiently. As it often happens in science, a number of fancy methods were proposed (while overlooking a simpler working one). Luckily, it was discovered that the graph can be constructed by iteratively building the graph and using a graph-based retrieval algorithm to find nearest neighbors for a new data point. A summit (or at least a local maximum) of this endevour is a Hierarchical Navigable Small World graph (HNSW) method, which combines efficient pruning graph-pruning heuristics, a multi-layer and multi-resolution graph topology with a bunch of efficiency tricks.

It was also known (but not well-known) that graph-based retrieval algorithms can work for generic (mostly metric) distances. So, I personally was interested in pushing these (and other) methods even further and applying them to non-metric and non-symmetric similarities. One ultimate objective was to replace or complement a standard term-based inverted file in the text retrieval scenario. Well, the idea to apply k-NN search to text retrieval is not novel (see, again, my thesis for some references). However, I do no think that anybody has shown convincingly that this is a viable approach.

On the way towards achieving this objective, there are a lot of difficulties. First of all, it is not clear which representations of text and queries one can use (I have somewhat explored this direction, but the problem is clearly quite hard). Ideally, we would represent everything as dense vectors, but I do not think that the cosine similarity between dense vectors is particularly effective in the domain of adhoc text retrieval (it works better for classification, though). I am also convinced that in many cases whenever dense representations work well, a combination of dense and sparse bag-of-words representations works even better. Should we embrace these hybrid representations in the future, we cannot use traditional term-based inverted files directly (i.e., without doing a simpler search with subsequent re-ranking). Instead, we are likely to rely on more generic algorithms for k-nearest neighbor (k-NN) search.

Second, instead of trying to search using a complex similarity, we can use such a similarity only for re-ranking. Of course, there should be obviously limits to the re-ranking approach. However, a re-ranking bag-of-words pipeline (possibly with some query rewriting) is a baseline that is hard to beat.

Third, k-NN search is a notoriously hard problem, which in many cases cannot be solved exactly without sequentially comparing the query with every data point (the so called brute-force search). This is due to a well-known phenomenon called the curse of dimensionality. Often we have to resort to using approximate search algorithms, but these algorithms are not necessarily accurate. How much inaccuracy is ok? From my experimental results I conclude that the leeway is quite small: We can trade a bit of accuracy for extra efficiency, but not too much.

Because approximate k-NN search leads to loss in accuracy, in my opinion, it does not make sense to use it with simple similarities like BM25. Instead, we should be trying to construct a similarity that beats BM25 by a good margin and do retrieval using this fancier similarity. My conjecture is that by doing so we can be more accurate and more efficient at the same time! This is one of the central ideas of my thesis. On one collection I got promising results supporting this conjecture (which is BTW an improvement of our CIKM'16 results). However, more needs to be done, in particular, by comparing against potentially stronger baselines.

In conclusion, I note that this work would have been impossible without encouragement, inspiration, help, and advice of many people. Foremost, I would like to thank my advisor Eric Nyberg for his guidance, encouragement, patience, and assistance. I greatly appreciate his participation in writing a grant proposal to fund my research topic. I also thank my thesis committee: Jamie Callan, James Allan, Alex Hauptmann, and Shane Culpepper for their feedback.

I express deep and sincere gratitude to my family. I am especially thankful to my wife Anna, who made this adventure possible, and to my mother Valentina who encouraged and supported both me and Anna.

I thank my co-authors Bileg Naidan, David Novak, and Yury Malkov each of whom greatly helped me. Bileg sparked my interest in non-metric search methods and laid the foundation of our NMSLIB library. Yury made key improvements to the graph-based search algorithms. David greatly improved performance of pivot-based search algorithms, which allowed us to obtain first strong results for text retrieval.

I thank Chris Dyer for the discussion of IBM Model 1; Nikita Avrelin and Alexander Ponomarenko for implementing the first version of SW-graph in NMSLIB; Yubin Kim and Hamed Zamani for the discussion of pseudo-relevance feedback techniques (Hamed also greatly helped with Galago); Chenyan Xiong for the helpful discussion on embeddings and entities; Daniel Lemire for providing the implementation of the SIMD intersection algorithm; Lawrence Cayton for providing the data sets, the bbtree code, and answering our questions; Christian Beecks for answering questions regarding the Signature Quadratic Form Distance; Giuseppe Amato and Eric S. Tellez for help with data sets; Lu Jiang for the helpful discussion of image retrieval algorithms; Vladimir Pestov for the discussion on the curse of dimensionality; Mike Denkowski for the references on BLUE-style metrics; Karina Figueroa Mora for proposing experiments with the metric VP-tree applied directly to non-metric data. I also thank Stacey Young, Jennifer Lucas, and Kelly Widmaier for their help.

I also greatly appreciate the support from the National Science Foundation, which has been funding this project for two years.



Accurate Lucene BM25 : Redux

About five-six years ago, I discovered that a default Lucene BM25 similarity was giving me sub-optimal results, apparently due to a lossy encoding of document lengths (which was a part of Lucene's efficiency trick). I found this when I reimplemented BM25 on my own, but without a lossy document encoding. On my data, the difference was about 10%, which was far from being a trifle. I have run a good number of experiments where this difference was present. It was clearly not a random fluke or mirage. I eventually created a benchmark and published a blog post. I even made some noise on the Lucene dev list and promised to submit a patch. However, this did not happen as I got busy and Lucene changed its internal API.

Recently I was fortunate enough to revisit this problem thanks to Chris Kamphuis, Arjen P. de Vries, and Jimmy Lin who took me aboard their "Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants". They also did most of the work by testing several BM25 variants of which my accurate Lucene similarity was only a small piece. Somewhat surprisingly, the "often maligned approximation of document length" in Lucene performed nearly as well as the accurate similarity. Another result is that there are only very small differences among various BM25 implementations. I think it is an important finding on which I reflect in the very end of the post (please read that last paragraph).

Now, there are two voices in my head: one that "maligns the approximation of the document length" and another that says this approximation is ok. How should we reconcile the voices? Because the scope and the size of the paper did not permit a more thorough experimentation and description, I have carried an additional code analysis that has not been included into the paper. This analysis is below.

My original experiments were run with Lucene 6 (and earlier versions). Lucene 6 does not encode a document length directly. Instead, it approximates the inverse square root of the length. Thus, it introduces an approximation error for basically every possible document length! Lucene 7 supports the old scheme, but already introduces a new encoding of a document length, which stores small numbers (less than 25) exactly and retains four most significant binary digits for large numbers (see my test code), which is basically a variant of sign-free exponent-shifted quarter-precision format (additionally they count only the number of unique terms, which reduces the value of a document length that needs to be encoded). I think that this new approximation scheme is much more accurate .

Thus, I have to disagree a bit with somewhat optimistic conclusions of our paper that it does not matter which BM25 implementations to use. It seems to be true only for sufficiently careful implementations of BM25, including the recent Lucene's one. However, it is also clearly possible to screw up BM25 rather easily.

In conclusion, I would like to note that results of our paper should be treated in a broader context. There is somewhat anecdotal knowledge that various papers reported different effectiveness values for BM25 similarity on identical collections. Some people (including me) tended to think it was due to differences in BM25 implementations. However, the paper by Trotman et al showed that it was likely due to confounding factors such as the choice of lemmatization/stemming, tokenization, stopping, and data cleaning algorithms: Trotman, A., Puurula, A., & Burgess, B. (2014, November). Improvements to BM25 and language models examined. Clearly, our results support the conclusions made by Trotman et al.



Bringing a large Russian QA data set to light

"It is achingly apparent that an overwhelming amount of research in speech and language technologies considers exactly one human language: English." (Kyle Gorman) For this reason Emily Bender has been famously encouraging people to (1) explicitly name languages they work on (2) do more work on non-English-data. This has become known as a Bender rule.

Despite the importance of multilingual NLP, frankly speaking, it has been difficult to have an opportunity to work on non-English data (in the previous decade my only major opportunity was a stint on cross-lingual metaphor detection). I am therefore very pleased to have been recently participating in bringing to light a large Russian question-answering/reading-comprehension (QA) data set SberQuAD, which was created similarly to SQuAD.

I have been helping my co-authors Pavel Efimov and Pavel Braslavski (who did nearly all the work) to analyze and describe this data set. We have conducted a very thorough analysis and evaluated several powerful models. The full analysis is available online, but here I would like to highlight the following:

SberQuAD was created similarly to Stanford SQuAD. Yet, despite the similarities, all the models perform worse on SberQuAD than on SQuAD, which can be attributed to having only a single answer variant and fewer answers that are named entities. A lot of answers in SberQuAD still often contain an entity, but it is normally only a part of an answer. This stands in contrast to SQuAD where roughly half of the answers are named entities.



Universal Methods to Sort Complex Information Tested

About one year ago, a Quanta magazine published an article on the universal method to sort complex information. It is concerned with a theoretical work on solving a nearest-neighbor search problem for various metric distances. Even more precisely, this work attempts to answer a question about what kind of a distance metric permits an efficient nearest neighbor search. Though this is surely an important and solid theoretical work, the Quanta magazine completely ignores the fact that from the practical perspective this problem has satisfactory solutions. Not only existing methods work for metric distances, good results can often be obtained for weird non-metric dissimilarities (or distances as we like to call them). Sometimes, they work when these distances are even non-symmetric!

Are these methods are universal? They certainly are not, but nothing is universal in nearest-neighbor search. There are challenging data sets, which cannot be searched efficiently even with the Euclidean distance! This issue as well as the history of non-metric k-NN search is briefly surveyed in my thesis. However, in some cases we can do really well by using a tree-based or a neighborhood-graph based approaches. In my thesis, I carried out a series of evaluations to verify this. I am pleased that all of the main results are now formally published, in particular, including two recent SISAP papers:

  1. Boytsov, L., Nyberg. E., 2019. Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs.

  2. Boytsov, L., Nyberg. E., 2019. Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study.

I think these papers are concerned with important research questions and I am going to briefly highlight results.

Neighborhood-graphs is a class of new-old methods, which delivers state of the art results on many data sets. However, little is known how they behave on non-symmetric distances. We were probably the first to test them on non-symmetric distances such as KL-divergence [1, 2]. Turns out, however, these tests relied on data sets that were only mildly non-symmetric. In the follow-up work, we have really stress tested them and discovered the following:

  1. It is never a good idea to deal with non-symmetric distances by symmetrizing the distance first and using the symmetrized distance as a part of a filter-and-refine pipeline.

  2. However, it is not even necessary. In many cases, indeed, neighborhood-graphs deliver state-of-the-art performance out of the box.

  3. Importantly, one has to be consistent in the order of distance function arguments (although there are exceptions as I describe below). If the indexing procedure relies on a different order (e.g., by mistake), the results could be disastrous (I have made this mistake and it cost me a lot of time).

  4. That said, using a different distance function at index time can produce sometimes better results. Again this is not a universal property. One somewhat obvious choice of possibly better index-time distance function is a symmetrized variant of the original distance. Quite surprisingly, the argument-reversed distance can deliver good results too, but, as I explain above, the results can be disastrous for some other datasets and distances. I think this discovery begs a research question: what is the optimal distance-time function?

Although graph-based retrieval is state-of-the-art for high-dimensional data it can be an overkill for low-dimensional data, where tree-based approaches can work really well. In particular, we compare two approaches to adapt standard metric tree methods to non-metric similarities. One is the effective piecewise-linear modification of the pruning rule, which we published at NIPS in 2013. In fact, for the Euclidean distance, it is as efficient as the classic projection-based LSH. However, due to the linear nature of the approximation, it is sometimes not a good fit for non-metric dissimilarities. In contrast, Tomas Skopal TriGen algorithm can be better in this case.

TriGen is an ingenious algorithm that finds a monotonic distance transformation that makes a dissimilarity look more like metric. However, TriGen has two drawbacks: it does not work out of the box with non-symmetric distances and its implementation of the distance-modifying transformation can be a bit expensive. What we show is that, perhaps, the best solution is a hybrid: First, we can apply a cheap concave (or near concave) distance transformation such as the square root. Second, we can fit a piecewise-linear decision function for this transformed distance.

In conclusion, I want to emphasize that, although nearest-neighbor search has no universal solution, there are a number of working general-distance approaches. Some good solutions are implemented in NMSLIB, which is the first generic library for metric and non-metric k-NN search.



Pages

Subscribe to RSS - srchvrs's blog