On an inconsistency in IBM Watson papers

Blog

Directory

Submitted by srchvrs on Wed, 07/23/2014 - 05:51

A classic factoid question answering system is looking for facts about simple entities, such as names, locations, dates, etc... One common misconception is that such facts are always retrieved from a knowledge base. Perhaps, Google, which cares a lot about precision of its factoid answers, works exactly this way. However, for a less precise system aiming for a broader coverage, a direct answer lookup is not sufficient. Despite being very precise, the answer lookup can be used to answer only a handful of questions.

More commonly, factoids are extracted using a simple information retrieval-based algorithm. To this end, a question is superficially transformed and is subsequently fed as a query to an information retrieval engine. The result of such a query is a set of passages or documents which match the query well. Then, we extract entities (e.g., person or location names) from text snippets or document titles and finally rank them. One of the most useful ranking features is an answer type. For example, an answer to the question "What is the capital of the USA" is a location, so that we can ignore (or downweigh) entities that are not locations. Another common feature is a popularity (i.e., how often an entity appears in search results). The true answer tends to co-locate with query words, more frequently than unrelated entities.

The retrieval-based approach works pretty well, unless there is a substantial mismatch between question words and words that appear in the text. Query reformulation and expansion may sometime help to overcome the vocabulary mismatch. Yet, this is not always possible, because adding synonyms improves recall at the expense of precision. Consider the following Jeopardy! question: "Unlike most sea animals, in the Sea Horse this pair of sense organs can move independently of one another" (answer eyes). As argued in the paper "Finding needles in the haystack: Search and candidate generation", it is quite hard to find an answer using standard retrieval techniques, because the question mentions an obscure fact. In that, it is much easier to lift the veil of obscurity by checking knowledge base entries related to "sense organs". The IBM Watson creators call this type of search as the PRISMATIC search (named after the knowledge base PRISMATIC).

Further, they say that the PRISMATIC search is both precise and has a good coverage:
"On a test set of 3,508 Jeopardy! questions, the PRISMATIC candidate generator produces answers for 42.6% of them (1,495 questions). ... In addition, the PRISMATIC candidate generator produces far fewer wrong candidate answers than other candidate generation techniques. On average, 1 out of 57 PRISMATIC candidates is the correct answer to a given question, compared with 1 out of 134 candidates from the rest of the candidate generation pipeline."

One may interpret the previous statement as follows: The binary recall of this answer-generation technique is close to 50%. That is, for roughly half of the questions, the PRISMATIC search generates a short list of answers, one of which is the correct one! If this were true, it would have been great news for all QA folks. However, in another paper "Finding needles in the haystack: Search and candidate generation" creators of Watson present a more systematic evaluation of the PRISMATIC search. They find that a related binary recall (see Table 1) is only 8%. In other words, this strategy is potentially useful only for about one question out of ten.

It is a bit unclear as to how reconcile both statements. One good suggestion came from Petr Baudis. Petr proposed the following explanation: When authors say that the PRISMATIC candidate generator produces answers for 42.6% of all questions, this does not necessarily mean "42.6% of questions for which we get at least one correct answer on the candidate list". It may simply mean that in 42.6% of all cases, the PRISMATIC produces some answers, but, as we learn from another paper, they are mostly incorrect. The binary recall of 8% is still substantially better than the binary recall of the answer lookup (which is 3.5%, see Table 1), but much lower than that of the retrieval-based approach (which is almost 80%).

To conclude, the PRISMATIC search appears to be a good complementary strategy that fits somewhere between the classic retrieval-based approach and the exact answer lookup. Yet, it is not a replacement for the classic information retrieval-based approach.

srchvrs's blog

You are here