log in | about 
 

The first search algorithm based on user behavior was invented more than 60 years ago

The first search algorithm based on user behavior was invented more than 60 years ago. I learned this from a seminal paper authored by Yehoshua Bar-Hillel. Bar-Hillel was an influential logician, philosopher, and linguist. He was also known as a proponent of the categorial grammar. Being a logician, Bar-Hillel was very critical of statistical and insufficiently rigorous methods. So, he wrote opinionatedly:

A colleague of mine, a well-known expert in information theory, proposed recently, as a useful tool for literature search, the compiling of pair-lists of documents that are requested together by users of libraries. He even suggested, if I understood him rightly, that the frequency of such co-requests might conceivably serve as an indicator of the degree of relatedness of the topics treated in these documents. I believe that this proposal should be treated with the greatest reserve.

On one hand, Bar-Hillel was very critical. On the other hand, he was also politic and cited his friend invention anonymously. This left us wondering: Who was that prominent information theorist?



Not all date extractors are born equal: on using the right extractor in Stanford NLP toolkit

If you use a Stanford NLP toolkit, how do you extract dates? One may be tempted to directly use the statistical named entity recognizer, included in the toolkit. A demo of this NER is provided online. One immediate catch here is that there are several pre-trained statistical models. The demo code published online is using a 3-class model, which doesn't include dates! One should be careful enough to use the model english.muc.7class.distsim.crf.ser.gz.

The 7-class Muc-trained model is working ok, but there are a couple of issues. First of all, it often fails to detect a complete date. Go to the Stanford NER demo page, select the model english.muc.7class.distsim.crf.ser.gz and enter the text "Barack Hussein Obama was born on 4 August 1961.". The output would be like this:

Barack Hussein Obama was born on 4 August 1961.

Potential tags:
  LOCATION
  TIME
  PERSON
  ORGANIZATION
  MONEY
  PERCENT
  DATE

As you can see, the month and the year were tagged, but not the date of the month. BTW, not all of the Barack Obama's name was tagged either. Surely, I used a bit non-standard format of the date, but this format occurs frequently on the Web. Another issue is that the statistical tagger does not support date standardization. For example, given the dates August 1961 and 4 August 1961, the statistical NER cannot provide standardized date representations such as 1961-08 and 1961-08-04, which are easy to process and compare.

How big is the deal? My evidence is mostly anecdotal as I do not have a large enough sample to obtain reliable results. Yet, in one of our custom question answering pipeline, I gained about 20% in accuracy by using a rule-based Stanford Temporal Tagger (SUTime), instead of the statistical NER.

Interestingly, the SUTime is enabled automatically with the StanfordCoreNLP pipeline by including the NER annotator. The catch, again, is that it is not included when you use the statistical NER directly. Not only the SUTime has better recall and precision, but it also returns dates in the normalized form. An example of using the SUTime is provided by Stanford folks.



On teaching kids to program

What is a good framework to teach kids programming? For several years, I and my son were playing with Scratch. We have been making a slow progress and finally reached a point of understanding basic concepts. Scratch is a great environment, which hides a lot of implementation details from you. You have a few basic graphic primitives (the stage and the sprites) and a bunch of functions to manipulate them. Coding is easy because you create scripts by drag-and-dropping basic programming primitives: loops, conditionals, etc... Overall, Scratch is a powerful framework that allows one to create simple games/programs quickly (implementation speed is important here!).

The Scratch paradigm is not exactly event-driven programming. Yet, it involves parallel processing. For example, if one sprite (a hero) bumps into another sprite (a villain), it is easy to detect and process this event. However, there will be one script that controls the behavior of the hero and another script that controls the behavior of the villain. There are "naturally" occurring events (e.g, one sprite touches another) and programmatically initiated messages (broadcasts).

If you want to start with Scratch, there is a good project book. Beware, though, that there seems to be bugs in the program descriptions: be prepared to debug and fix them.

One big problem with the Scratch is that it is all parallel. As many of you know full well, parallelism entails race conditions. This is truly bad news, because even full-grown bearded software programmers have troubles with race conditions. How can we expect kids to tackle them easily? This is why it may be easier to start with a (mostly) single-thread programming.

There is a good a review of platforms/languages that can be taught after or instead of the Scratch (via Greg Linden). One option it suggests is JavaScript. After pondering for a while, I think this option is best. JavaScript is not terribly hard, there are exceptional interactive tutorials, and it is a production language, which is used in Web-development.

The latter reason seems to be quite appealing. Teens like to do cool things. And what can be cooler than an interactive HTML page that you create yourself?

UPDATE: interestingly, it seems that we will be studying Python instead. In that, it seems that the Code Academy is a good place to start.



On an inconsistency in IBM Watson papers

A classic factoid question answering system is looking for facts about simple entities, such as names, locations, dates, etc... One common misconception is that such facts are always retrieved from a knowledge base. Perhaps, Google, which cares a lot about precision of its factoid answers, works exactly this way. However, for a less precise system aiming for a broader coverage, a direct answer lookup is not sufficient. Despite being very precise, the answer lookup can be used to answer only a handful of questions.

More commonly, factoids are extracted using a simple information retrieval-based algorithm. To this end, a question is superficially transformed and is subsequently fed as a query to an information retrieval engine. The result of such a query is a set of passages or documents which match the query well. Then, we extract entities (e.g., person or location names) from text snippets or document titles and finally rank them. One of the most useful ranking features is an answer type. For example, an answer to the question "What is the capital of the USA" is a location, so that we can ignore (or downweigh) entities that are not locations. Another common feature is a popularity (i.e., how often an entity appears in search results). The true answer tends to co-locate with query words, more frequently than unrelated entities.

The retrieval-based approach works pretty well, unless there is a substantial mismatch between question words and words that appear in the text. Query reformulation and expansion may sometime help to overcome the vocabulary mismatch. Yet, this is not always possible, because adding synonyms improves recall at the expense of precision. Consider the following Jeopardy! question: "Unlike most sea animals, in the Sea Horse this pair of sense organs can move independently of one another" (answer eyes). As argued in the paper "Finding needles in the haystack: Search and candidate generation", it is quite hard to find an answer using standard retrieval techniques, because the question mentions an obscure fact. In that, it is much easier to lift the veil of obscurity by checking knowledge base entries related to "sense organs". The IBM Watson creators call this type of search as the PRISMATIC search (named after the knowledge base PRISMATIC).

Further, they say that the PRISMATIC search is both precise and has a good coverage:
"On a test set of 3,508 Jeopardy! questions, the PRISMATIC candidate generator produces answers for 42.6% of them (1,495 questions). ... In addition, the PRISMATIC candidate generator produces far fewer wrong candidate answers than other candidate generation techniques. On average, 1 out of 57 PRISMATIC candidates is the correct answer to a given question, compared with 1 out of 134 candidates from the rest of the candidate generation pipeline."

One may interpret the previous statement as follows: The binary recall of this answer-generation technique is close to 50%. That is, for roughly half of the questions, the PRISMATIC search generates a short list of answers, one of which is the correct one! If this were true, it would have been great news for all QA folks. However, in another paper "Finding needles in the haystack: Search and candidate generation" creators of Watson present a more systematic evaluation of the PRISMATIC search. They find that a related binary recall (see Table 1) is only 8%. In other words, this strategy is potentially useful only for about one question out of ten.

It is a bit unclear as to how reconcile both statements. One good suggestion came from Petr Baudis. Petr proposed the following explanation: When authors say that the PRISMATIC candidate generator produces answers for 42.6% of all questions, this does not necessarily mean "42.6% of questions for which we get at least one correct answer on the candidate list". It may simply mean that in 42.6% of all cases, the PRISMATIC produces some answers, but, as we learn from another paper, they are mostly incorrect. The binary recall of 8% is still substantially better than the binary recall of the answer lookup (which is 3.5%, see Table 1), but much lower than that of the retrieval-based approach (which is almost 80%).

To conclude, the PRISMATIC search appears to be a good complementary strategy that fits somewhere between the classic retrieval-based approach and the exact answer lookup. Yet, it is not a replacement for the classic information retrieval-based approach.



What is a non-metric space library?

A Non-Metric Space Library (NMSLIB) is an ongoing effort to build an effective toolkit for searching in generic metric and non-metric spaces. It originated as a personal project of my co-author Bileg (Bilegsaikhan Naidan), who is also a major contributor. In the spring of 2013, we needed to evaluate similarity search methods for non-metric spaces. Yet, no good software suite was available (nor was there a collection of benchmarks). This is why we created one by expanding Bileg's codebase and making it public.

There are important contributions from other people as well. In particular, I would like to mention Yury Malkov and David Novak. Yury has contributed a blazingly fast variant of the proximity/neighborhood graph . David has proposed a way to greatly improve the quality of pivot-based projections for sparse vector spaces.

We presented our main experimental results at NIPS'13 & VLDB'15 and our engineering work at SISAP'13. Some of the experimental results are available in our VLDB'15 report. We also participated in a public evaluation, codenamed ann-benchmarks in 2015 and 2016. Both times our graph-based implementations were quite competitive against state-of-the-art methods designed specifically for the Euclidean and the angular distance: random-projection KD-trees (FLANN and Annoy) and LSH (MIT's FALCONN). Please, check our May 2016 results in the end of the post. It may very well be new state-of-the-art in kNN search.

While our methods are fast in the well-studied Euclidean space equipped with scalar product, they also generalize well to non-Euclidean, non-metric, and non-vector spaces. For example, check our VLDB'15 evaluation. Fast search times often come at the expense of longer indexing times. From this perspective, LSH and random projection methods do outperform NMSLIB. However, the two approaches are not mutually exclusive. For example, it may be possible to rely on recent major advances in LSH (the efficient FALCONN library) to build proximity graphs in the Euclidean space faster.

From the engineering perspective, our implementation is mature. Nearly all methods can run in a multi-threaded mode. We have Python bindings and a multi-threaded query server. In particular, there is a native Java client. In general, we steadily drift in the direction of a production-friendly library. However, our main focus is on the ease of experimentation and comparison to the existing baselines. Here, our work is far from being finished. In particular we now attempt to apply our toolkit to NLP problems. I summarized some thoughts on this topic in a talk at ML lunch. Currently, we work on replacing a term-based retrieval in the ad hoc search task. Some preliminary results will appear in Leo's thesis proposal.

Appendix: our 2016 ann-benchmark results. It is our own re-run using only the best methods on the c4.2xlarge Amazon EC2 instance. The results are very similar to the results obtained by Eric in May-June 2016, but there are some minor differences due to, e.g., selecting a different random subset of queries:

1.19M 100d GloVe, cosine similarity.
1M 128d SIFT features, Euclidean distance:


Pages

Subscribe to RSS - blogs