Nov 2024 UPDATE: In simple terms and using modern terminology, IBM Watson was very similar to a retrieval-augmented QA system. Unlike modern systems, it was not truly generative, but rather extractive: The system would find relevant entities in retrieved passages and "aggregate" them. We can call it generation, but only with some stretch of the imagination. Moreover, unlike retrieval-augmented LLMs, which have their own "neural memory" that can be used to answer some questions without retrieval, IBM Watson was solely relying on retrieval from unstructured (and to a small degree structured) data collections.
This is written in response to a Quora question, which asks about internals of IBM Watson question answering (QA) system. Feel free to vote there for my answer! Previously I briefly compared IBM Watson approach to that of DeepMind, albeit without going into details of how IBM Watson works. Here I fill this gap.
I am not sure anybody knows exactly what was under the hood. However, there is a series of papers published by IBM most of which I read end-to-end more than once. One overview paper can be found here. The list of papers can be found here, most PDFs can be easily googled :-) There is also a lengthy (but quite relevant) survey (by an IBM Watson team member J. Prager) that covers some the details of the retrieval-based question answering:
Prager, John. "Open-domain question–answering." Foundations and Trends® in Information Retrieval 1.2 (2007): 91-231.
First things first: IBM Watson team incorporated both symbolic/logical systems and a classic redundancy-based retrieval QA into their system. However, there are only few questions (about 1%) that they were able to answer by logical inference and querying of structured knowledge sources.
I would reiterate that a vast majority of questions are answered using a carefully tuned retrieval-based system, which heavily relies on the fact that Jeopardy answers are factoids: short noun phrases such as named entities (e.g., dates, names of famous persons, or city names). Hence, the QA system does not really need to answer a question, e.g., by synthesizing an answer, or by doing some complicated inference. It should instead extract a potential answer and collect enough statistical evidence that this answer is correct.
And, indeed, a retrieval-based factoid QA system finds passages lexically matching the question and extracts potential answers from these passages. It then uses a carefully tuned statistical model to figure out which candidate answers are good. This model likely does not involve any sophisticated reasoning that humans are capable of. That said, I still consider IBM Watson as one of the greatest achievements in the AI field.
The fact that Jeopardy questions are long greatly helps to find the so-called candidate passages, which are likely to contain an answer. Finding these passages is based largely on the lexical overlap between the question and the answer passage. Stephen Wolfram even ran an experiment where he found that a single search engine can find candidate passages for nearly 70% of all answers.
Furthermore, there is a good coverage of Jeopardy topics in Wikipedia. I cite: "We conducted an experiment to evaluate the coverage of Wikipedia articles on Jeopardy! questions and found that the vast majority of Jeopardy! answers are titles of Wikipedia documents [10]. Of the roughly 5% of Jeopardy! answers that are not Wikipedia titles, some included multiple entities, each of which is a Wikipedia title, such as Red, White, and Blue, whereas others were sentences or verb phrases, such as make a scarecrow or fold an American flag." Chu-Carroll, Jennifer, et al. "Finding needles in the haystack: Search and candidate generation." IBM Journal of Research and Development 56.3.4 (2012): 6-1.
I have to say that just throwing a bag-of-words query into a search engine can be a suboptimal approach, but the IBM Watson team wrote a bunch of complex question-rewriting procedures (in Prolog!) to ensure these queries were good. Not all candidate passages are generated in this way: I have covered another generation approach in another blog post.
After candidate passages are retrieved, IBM Watson extracts potential answers, which is not a trivial task. How does it find them? The actual model is sure rather complicated, but it would largely look for named entities and more generic noun phrases. However, not all entities/phrases are weighted equally. What affects the weights? Three things:
- A type of the question and the type of the entity (or rather their compatibility score);
- Existence of additional supporting evidence;
- How frequently these entities/noun phrases appear in candidate passages.
For example, if the question is "Who is the mayor of Toronto?" we know that the answer is a person. Hence, we can downweigh named entities whose type is not a person. The actual answer typing processing is surely more complicated, and there is a separate paper describing it in more detail:
Murdock, J. William, et al. "Typing candidate answers using type coercion." IBM Journal of Research and Development 56.3.4 (2012): 7-1.
What is important is that incorporating other types of relations (e.g., spatial or temporal) in addition to the answer-question type compatibility did not seem to result in substantial improvements (though some gains were observed). See results in Tables 1 and 2 of the paper:
Kalyanpur, Aditya, et al. "Structured data and inference in DeepQA." IBM Journal of Research and Development 56.3.4 (2012): 10-1.
Furthermore, for each candidate entry X, we can try to construct a query like "X is a mayor of Toronto" and find matching passages with good lexical overlap with this additional evidencing query. If such passages exist, they provide evidence that X is, indeed, an answer to the question.
There is a separate paper devoted to the evidencing process:
Murdock, J. William, et al. "Textual evidence gathering and analysis." IBM Journal of Research and Development 56.3.4 (2012): 8-1.
Last, but not least, the ranking approach (for candidate answers) takes into account the (weighted) number of occurrences. In other words, we expect true answers to appear more frequently in retrieved candidate passages. Although this assumption seems to be a bit simplistic it works well due to redundancy: There are lot of answer passages for simple well-known factoids. A nice paper exploring this phenomenon was written by Jimmy Lin:
Lin, Jimmy. "An exploration of the principles underlying redundancy-based factoid question answering." ACM Transactions on Information Systems (TOIS) 25.2 (2007): 6.
If you find this mini-survey useful, feel free to cite it:
@misc{Boytsov_2018,
title={Demystifying IBM Watson},
url={http://searchivarius.org/blog/demystifying-ibm-watson},
author={Boytsov, Leonid},
year={2018},
month={Jun}}