log in | about 
 

We believe that text retrieval can and should benefit from using generic k-NN search algorithms. To support our conjecture, we carried out a bunch of experiments, published a paper, as well as related software. A high-level summary of the paper is given in the talk, whose text we also post online (just in case, slides are also available).

What is all this about? Why should one use k-NN search? In a classic filter-and-refine pipeline, you would usually get candidate result set filtered by TFxIDF. What if we replace TFxIDF with some expensive-to-compute but accurate similarity? Clearly, we will not be able to use text-based inverted files to answer queries efficiently. At the same time, a brute-force comparison of query against every document would be terribly slow. However, we can try answer queries using some distance-based approximate k-NN search algorithm. If such approach is sufficiently fast, we might get a practical tool to find documents that are not possible or hard to find using TFxIDF based retrieval.

I would not claim that we have fully achieved our objective, but we have probably made a good step towards achieving it. In fact, the phrase "Let's replace" in the title of the paper means only that we see such a replacement as an important goal.