log in | about 
 

Demystifying IBM Watson

This is written in response to a Quora question, which asks about internals of IBM Watson question answering (QA) system. Feel free to vote there for my answer! Previously I briefly compared IBM Watson approach to that of DeepMind, albeit without going into details of how IBM Watson works. Here I fill this gap.

I am not sure anybody knows exactly what was under the hood. However, there is a series of papers published by IBM most of which I read end-to-end more than once. One overview paper can be found here. The list of papers can be found here, most PDFs can be easily googled :-) There is also a lengthy (but quite relevant) survey (by an IBM Watson team member J. Prager) that covers some the details of the retrieval-based question answering:

Prager, John. "Open-domain question–answering." Foundations and Trends® in Information Retrieval 1.2 (2007): 91-231.

First things first: IBM Watson team incorporated both symbolic/logical systems and a classic redundancy-based retrieval QA into their system. However, there are only few questions (about 1%) that they were able to answer by logical inference and querying of structured knowledge sources.

I would reiterate that a vast majority of questions are answered using a carefully tuned retrieval-based system, which heavily relies on the fact that Jeopardy answers are factoids: short noun phrases such as named entities (e.g., dates, names of famous persons, or city names). Hence, the QA system does not really need to answer a question, e.g., by synthesizing an answer, or by doing some complicated inference. It should instead extract a potential answer and collect enough statistical evidence that this answer is correct.

And, indeed, a retrieval-based factoid QA system finds passages lexically matching the question and extracts potential answers from these passages. It then uses a carefully tuned statistical model to figure out which candidate answers are good. This model likely does not involve any sophisticated reasoning that humans are capable of. That said, I still consider IBM Watson as one of the greatest achievements in the AI field.

The fact that Jeopardy questions are long greatly helps to find the so-called candidate passages, which are likely to contain an answer. Finding these passages is based largely on the lexical overlap between the question and the answer passage. Stephen Wolfram even ran an experiment where he found that a single search engine can find candidate passages for nearly 70% of all answers.

Furthermore, there is a good coverage of Jeopardy topics in Wikipedia. I cite: "We conducted an experiment to evaluate the coverage of Wikipedia articles on Jeopardy! questions and found that the vast majority of Jeopardy! answers are titles of Wikipedia documents [10]. Of the roughly 5% of Jeopardy! answers that are not Wikipedia titles, some included multiple entities, each of which is a Wikipedia title, such as Red, White, and Blue, whereas others were sentences or verb phrases, such as make a scarecrow or fold an American flag." Chu-Carroll, Jennifer, et al. "Finding needles in the haystack: Search and candidate generation." IBM Journal of Research and Development 56.3.4 (2012): 6-1.

I have to say that just throwing a bag-of-words query into a search engine can be a suboptimal approach, but the IBM Watson team wrote a bunch of complex question-rewriting procedures (in Prolog!) to ensure these queries were good. Not all candidate passages are generated in this way: I have covered another generation approach in another blog post.

After candidate passages are retrieved, IBM Watson extracts potential answers, which is not a trivial task. How does it find them? The actual model is sure rather complicated, but it would largely look for named entities and more generic noun phrases. However, not all entities/phrases are weighted equally. What affects the weights? Three things:

  1. A type of the question and the type of the entity (or rather their compatibility score);
  2. Existence of additional supporting evidence;
  3. How frequently these entities/noun phrases appear in candidate passages.

For example, if the question is "Who is the mayor of Toronto?" we know that the answer is a person. Hence, we can downweigh named entities whose type is not a person. The actual answer typing processing is surely more complicated, and there is a separate paper describing it in more detail:

Murdock, J. William, et al. "Typing candidate answers using type coercion." IBM Journal of Research and Development 56.3.4 (2012): 7-1.

What is important is that incorporating other types of relations (e.g., spatial or temporal) in addition to the answer-question type compatibility did not seem to result in substantial improvements (though some gains were observed). See results in Tables 1 and 2 of the paper:

Kalyanpur, Aditya, et al. "Structured data and inference in DeepQA." IBM Journal of Research and Development 56.3.4 (2012): 10-1.

Furthermore, for each candidate entry X, we can try to construct a query like "X is a mayor of Toronto" and find matching passages with good lexical overlap with this additional evidencing query. If such passages exist, they provide evidence that X is, indeed, an answer to the question.

There is a separate paper devoted to the evidencing process:

Murdock, J. William, et al. "Textual evidence gathering and analysis." IBM Journal of Research and Development 56.3.4 (2012): 8-1.

Last, but not least, the ranking approach (for candidate answers) takes into account the (weighted) number of occurrences. In other words, we expect true answers to appear more frequently in retrieved candidate passages. Although this assumption seems to be a bit simplistic it works well due to redundancy: There are lot of answer passages for simple well-known factoids. A nice paper exploring this phenomenon was written by Jimmy Lin:

Lin, Jimmy. "An exploration of the principles underlying redundancy-based factoid question answering." ACM Transactions on Information Systems (TOIS) 25.2 (2007): 6.



Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching: We Need More k-NN Search!

The final version of my thesis "Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching" is now available online. An important by-product of my research is an efficient NMSLIB library, which I develop jointly with other folks. In a podcast with Radim Řehůřek (Gensim author) I discuss this project, its goals, and its history in detail.

Although efficiency is an important part of the thesis, it is primarily not about efficiency. Most importantly, I try to deliver the following messages:

  1. We have very flexible retrieval tools, in particular, graph-based retrieval algorithms, which can work well for a wide variety of similarity functions. In other words, we do not have to limit ourselves neither to inner-product similarities (e.g., the Euclidean distance) nor to even metric spaces.
  2. When "queries" are long, these algorithms can challenge traditional term-based inverted files. So, in the future, I expect retrieval systems to rely less on classic term-based inverted files and more on generic k-NN search algorithms (including graph-based retrieval algorithms). I think it is not a question of "IF", but rather a question of "WHEN".

Graph-based retrieval is an old new idea, which has been around for more than twenty years. This idea is beautifully simple: Build a graph where sufficiently close points are connected by edges. Such graphs come in various flavors and can be collectively called neighborhood graphs or proximity graphs. Given a neighborhood graph, nearest neighbor and other queries can be answered (mostly only approximately) by traversing the graph in a direction towards the query (and starting from, e.g., a random node). I cover the history of this idea in my thesis in more detail, but the earliest reference for this approach that I know is the seminal paper by Sunil Arya and David Mount (BTW, David Mount is co-author of the well-known ANN library).

Despite this early discovery, the practicality of graph-based retrieval in high-dimensional spaces was limited because we did not know how to construct neighborhood graphs efficiently. As it often happens in science, a number of fancy methods were proposed (while overlooking a simpler working one). Luckily, it was discovered that the graph can be constructed by iteratively building the graph and using a graph-based retrieval algorithm to find nearest neighbors for a new data point. A summit (or at least a local maximum) of this endevour is a Hierarchical Navigable Small World graph (HNSW) method, which combines efficient pruning graph-pruning heuristics, a multi-layer and multi-resolution graph topology with a bunch of efficiency tricks.

It was also known (but not well-known) that graph-based retrieval algorithms can work for generic (mostly metric) distances. So, I personally was interested in pushing these (and other) methods even further and applying them to non-metric and non-symmetric similarities. One ultimate objective was to replace or complement a standard term-based inverted file in the text retrieval scenario. Well, the idea to apply k-NN search to text retrieval is not novel (see, again, my thesis for some references). However, I do no think that anybody has shown convincingly that this is a viable approach.

On the way towards achieving this objective, there are a lot of difficulties. First of all, it is not clear which representations of text and queries one can use (I have somewhat explored this direction, but the problem is clearly quite hard). Ideally, we would represent everything as dense vectors, but I do not think that the cosine similarity between dense vectors is particularly effective in the domain of adhoc text retrieval (it works better for classification, though). I am also convinced that in many cases whenever dense representations work well, a combination of dense and sparse bag-of-words representations works even better. Should we embrace these hybrid representations in the future, we cannot use traditional term-based inverted files directly (i.e., without doing a simpler search with subsequent re-ranking). Instead, we are likely to rely on more generic algorithms for k-nearest neighbor (k-NN) search.

Second, instead of trying to search using a complex similarity, we can use such a similarity only for re-ranking. Of course, there should be obviously limits to the re-ranking approach. However, a re-ranking bag-of-words pipeline (possibly with some query rewriting) is a baseline that is hard to beat.

Third, k-NN search is a notoriously hard problem, which in many cases cannot be solved exactly without sequentially comparing the query with every data point (the so called brute-force search). This is due to a well-known phenomenon called the curse of dimensionality. Often we have to resort to using approximate search algorithms, but these algorithms are not necessarily accurate. How much inaccuracy is ok? From my experimental results I conclude that the leeway is quite small: We can trade a bit of accuracy for extra efficiency, but not too much.

Because approximate k-NN search leads to loss in accuracy, in my opinion, it does not make sense to use it with simple similarities like BM25. Instead, we should be trying to construct a similarity that beats BM25 by a good margin and do retrieval using this fancier similarity. My conjecture is that by doing so we can be more accurate and more efficient at the same time! This is one of the central ideas of my thesis. On one collection I got promising results supporting this conjecture (which is BTW an improvement of our CIKM'16 results). However, more needs to be done, in particular, by comparing against potentially stronger baselines.

In conclusion, I note that this work would have been impossible without encouragement, inspiration, help, and advice of many people. Foremost, I would like to thank my advisor Eric Nyberg for his guidance, encouragement, patience, and assistance. I greatly appreciate his participation in writing a grant proposal to fund my research topic. I also thank my thesis committee: Jamie Callan, James Allan, Alex Hauptmann, and Shane Culpepper for their feedback.

I express deep and sincere gratitude to my family. I am especially thankful to my wife Anna, who made this adventure possible, and to my mother Valentina who encouraged and supported both me and Anna.

I thank my co-authors Bileg Naidan, David Novak, and Yury Malkov each of whom greatly helped me. Bileg sparked my interest in non-metric search methods and laid the foundation of our NMSLIB library. Yury made key improvements to the graph-based search algorithms. David greatly improved performance of pivot-based search algorithms, which allowed us to obtain first strong results for text retrieval.

I thank Chris Dyer for the discussion of IBM Model 1; Nikita Avrelin and Alexander Ponomarenko for implementing the first version of SW-graph in NMSLIB; Yubin Kim and Hamed Zamani for the discussion of pseudo-relevance feedback techniques (Hamed also greatly helped with Galago); Chenyan Xiong for the helpful discussion on embeddings and entities; Daniel Lemire for providing the implementation of the SIMD intersection algorithm; Lawrence Cayton for providing the data sets, the bbtree code, and answering our questions; Christian Beecks for answering questions regarding the Signature Quadratic Form Distance; Giuseppe Amato and Eric S. Tellez for help with data sets; Lu Jiang for the helpful discussion of image retrieval algorithms; Vladimir Pestov for the discussion on the curse of dimensionality; Mike Denkowski for the references on BLUE-style metrics; Karina Figueroa Mora for proposing experiments with the metric VP-tree applied directly to non-metric data. I also thank Stacey Young, Jennifer Lucas, and Kelly Widmaier for their help.

I also greatly appreciate the support from the National Science Foundation, which has been funding this project for two years.



On the worthiness of PhD

This is written in response to a Quora question, which asks whether a PhD was worth it. Feel free to vote there for my answer on Quora!

A bit of a background: after about 10 years working on various database, infrastructure, and web projects, I decided to get more research experience. You can consider it a mid-career change with a caveat that I did not completely change the field (software development and computer science), but rather transitioned to work on more exciting problems.

I do not think it was worthwhile financially, but I have never thought it would be. All in all, I think I might eventually break even. Was it worthwhile otherwise? Have I achieved my goals?

First of all, I started working on problems which I previously had little chance to work on. Before joining the program, I was working a bit on information retrieval (IR) applications and infrastructure. If you are generic web/database developer, you will likely be pigeonholed into one of these exciting positions for the rest of your life. Now, I have projects in speech recognition, NLP, and, IR. In particular, I believe this work can improve doctors lives and prevent their burnout.

Second, I believe my PhD studies was a beginning of a mind expansion journey that I do not intended to finish (till death do us part).

Third, because I worked hard to get a degree from a recognized institution, I get quite a bit of attention from recruiters.

Have I achieved all my goals? The answer is no and it is still a work in progress. I have become an applied scientist, but I am still quite interested in working on more fundamental problems.

I am generally satisfied with my PhD studies. I believe it did open some new doors. However, consider the following:

  1. I am in the booming field of computer science. Furthermore, I am in the booming sub-field of speech and language processing.
  2. Although I have not become famous, two of the research libraries I co-authored have become reasonably well-known and my papers get some citations.
  3. I have obtained my PhD from a recognized institution a bit faster than my department average (by design a US PhD is supposed to take about six years). Doing so was not a walk in the park! I know people who got stuck for 10 years.
  4. That my original background was applied math and software engineering (combined with substantial real-world experience) was certainly quite helpful in achieving this goal.
  5. In that, I am still not quite sure what I am going to do in the near future.

Given an about 50% failure rate on a way to get a PhD degree, the potential sleep deprivation, the burnout, loss of interest to research, and other possibly unlucky circumstances, I can easily imagine how getting a PhD can be an extremely frustrating experience in terms of morale, finance, and health.



Robert Mercer's contribution to the development of machine translation technologies

This is written in response to a Quora question, which asks about Robert Mercer's contribution to the development of machine translation technologies. Feel free to vote there for my answer on Quora!

Robert Mercer (Peter Brown and a few other folks) played a pivotal and crucial role in the creation of the first modern translation models. They were able to create the first modern large scale noisy-channel translation system and publish the first paper on the subject. They created a series of IBM Model X models and spearheaded a new research direction (which is huge nowadays).

Recently Robert achieved an ACL lifetime achievement award for his pioneering work on machine translation. He was recently interviewed on the topic and there is a nice transcript of the story that uncovers a lot of historical details: Twenty Years of Bitext.



How do we make the architecture more efficient for machine learning systems, such as TensorFlow, without just adding more CPUs, GPUs, or ASCIs?

This is written in response to a Quora question, which asks about improving the efficiency of machine learning models without increasing hardware capacity. Feel free to vote there for my answer on Quora!

Efficiency in machine learning in general and deep learning in particular is a huge topic. Depending on what is the goal, different tricks can be applied.

  1. If the model is too large, or you have an ensemble, you can train a much smaller student model that mimics behavior of a large model. You can train to predict directly the probability distribution (for classification). The classic paper: "Distilling the Knowledge in a Neural Network" by Hinton et al., 2015.

  2. Use a simpler model and/or smaller model, which parallelizes well. For example, one reason transformer neural models are effective is that they are easier/faster to train compared to LSTMs.

  3. If the model does not fit into memory, you can train it using mixed precision: "Mixed precision training" by Narang et al 2018.

  4. Another trick, which comes at the expense of run-time, consists in discarding some of the tensors during training and recomputing them when necessary: "Low-Memory Neural Network Training: A Technical Report" Sohoni et al, 2019. There is a Google library for this: "Introducing GPipe, an Open Source Library for Efficiently Training Large-scale Neural Network Models."

  5. There is a tons of work on quantization (see, e.g., Fixed Point Quantization of Deep Convolutional Networks" by Lin et al 2016) and pruning of neural networks ("The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks" by Frankle and Carbin.) I do not remember a reference, but it is possible to train quantized models directly so that they use less memory.



Pages

Subscribe to RSS - blogs