









Submitted by srchvrs on Sat, 06/16/2018  18:57
This is written in response to a Quora question, which asks about internals of IBM Watson question answering (QA) system. Feel free to vote there for my answer! Previously I briefly compared IBM Watson approach to that of DeepMind, albeit without going into details of how IBM Watson works. Here I fill this gap.
I am not sure anybody knows exactly what was under the hood. However, there is a series of papers published by IBM most of which I read endtoend more than once. One overview paper can be found here. The list of papers can be found here, most PDFs can be easily googled :) There is also a lengthy (but quite relevant) survey (by an IBM Watson team member J. Prager) that covers some the details of the retrievalbased question answering:
Prager, John. "Opendomain question–answering." Foundations and Trends® in Information Retrieval 1.2 (2007): 91231.
First things first: IBM Watson team incorporated both symbolic/logical systems and a classic redundancybased retrieval QA into their system. However, there are only few questions (about 1%) that they were able to answer by logical inference and querying of structured knowledge sources.
I would reiterate that a vast majority of questions are answered using a carefully tuned retrievalbased system, which heavily relies on the fact that Jeopardy answers are factoids: short noun phrases such as named entities (e.g., dates, names of famous persons, or city names). Hence, the QA system does not really need to answer a question, e.g., by synthesizing an answer, or by doing some complicated inference. It should instead extract a potential answer and collect enough statistical evidence that this answer is correct.
And, indeed, a retrievalbased factoid QA system finds passages lexically matching the question and extracts potential answers from these passages. It then uses a carefully tuned statistical model to figure out which candidate answers are good. This model likely does not involve any sophisticated reasoning that humans are capable of. That said, I still consider IBM Watson as one of the greatest achievements in the AI field.
The fact that Jeopardy questions are long greatly helps to find the socalled candidate passages, which are likely to contain an answer. Finding these passages is based largely on the lexical overlap between the question and the answer passage. Stephen Wolfram even ran an experiment where he found that a single search engine can find candidate passages for nearly 70% of all answers.
I have to say that just throwing a bagofwords query into a search engine can be a suboptimal approach, but the IBM Watson team wrote a bunch of complex questionrewriting procedures (in Prolog!) to ensure these queries were good. Not all candidate passages are generated in this way: I have covered another generation approach in my blog post.
After candidate passages are retrieved, IBM Watson extracts potential answers, which is not a trivial task. How does it find them? The actual model is sure rather complicated, but it would largely look for named entities and more generic noun phrases. However, not all entities/phrases are weighted equally. What affects the weights? Three things:
 A type of the question and the type of the entity (or rather their compatibility score);
 Existence of additional supporting evidence;
 How frequently these entities/noun phrases appear in candidate passages.
For example, if the question is "Who is the mayor of Toronto?" we know that the answer is a person. Hence, we can downweigh named entities whose type is not a person. The actual answer typing processing is surely more complicated, and there is a separate paper describing it in more detail:
Murdock, J. William, et al. "Typing candidate answers using type coercion." IBM Journal of Research and Development 56.3.4 (2012): 71.
What is important is that incorporating other types of relations (e.g., spatial or temporal) in addition to the answerquestion type compatibility did not seem to result in substantial improvements (though some gains were observed). See results in Tables 1 and 2 of the paper:
Kalyanpur, Aditya, et al. "Structured data and inference in DeepQA." IBM Journal of Research and Development 56.3.4 (2012): 101.
Furthermore, for each candidate entry X, we can try to construct a query like "X is a mayor of Toronto" and find matching passages with good lexical overlap with this additional evidencing query. If such passages exist, they provide evidence that X is, indeed, an answer to the question.
There is a separate paper devoted to the evidencing process:
Murdock, J. William, et al. "Textual evidence gathering and analysis." IBM Journal of Research and Development 56.3.4 (2012): 81.
Last, but not least, the ranking approach (for candidate answers) takes into account the (weighted) number of occurrences. In other words, we expect true answers to appear more frequently in retrieved candidate passages. Although this assumption seems to be a bit simplistic it works well due to redundancy: There are lot of answer passages for simple wellknown factoids. A nice paper exploring this phenomenon was written by Jimmy Lin:
Lin, Jimmy. "An exploration of the principles underlying redundancybased factoid question answering." ACM Transactions on Information Systems (TOIS) 25.2 (2007): 6.
Submitted by srchvrs on Tue, 05/15/2018  19:32
The final version of my thesis "Efficient and Accurate NonMetric kNN Search with Applications to Text Matching" is now available online. An important byproduct of my research is an efficient NMSLIB library, which I develop jointly with other folks. In a podcast with Radim Řehůřek (Gensim author) I discuss this project, its goals, and its history in detail.
Although efficiency is an important part of the thesis, it is primarily not about efficiency. Most importantly, I try to deliver the following messages:
 We have very flexible retrieval tools, in particular, graphbased retrieval algorithms, which can work well for a wide variety of similarity functions. In other words, we do not have to limit ourselves neither to innerproduct similarities (e.g., the Euclidean distance) nor to even metric spaces.
 When "queries" are long, these algorithms can challenge traditional termbased inverted files. So, in the future, I expect retrieval systems to rely less on classic termbased inverted files and more on generic kNN search algorithms (including graphbased retrieval algorithms). I think it is not a question of "IF", but rather a question of "WHEN".
Graphbased retrieval is an old new idea, which has been around for more than twenty years. This idea is beautifully simple: Build a graph where sufficiently close points are connected by edges. Such graphs come in various flavors and can be collectively called neighborhood graphs or proximity graphs. Given a neighborhood graph, nearest neighbor and other queries can be answered (mostly only approximately) by traversing the graph in a direction towards the query (and starting from, e.g., a random node). I cover the history of this idea in my thesis in more detail, but the earliest reference for this approach that I know is the seminal paper by Sunil Arya and David Mount (BTW, David Mount is coauthor of the wellknown ANN library).
Despite this early discovery, the practicality of graphbased retrieval in highdimensional spaces was limited because we did not know how to construct neighborhood graphs efficiently. As it often happens in science, a number of fancy methods were proposed (while overlooking a simpler working one). Luckily, it was discovered that the graph can be constructed by iteratively building the graph and using a graphbased retrieval algorithm to find nearest neighbors for a new data point. A summit (or at least a local maximum) of this endevour is a Hierarchical Navigable Small World graph (HNSW) method, which combines efficient pruning graphpruning heuristics, a multilayer and multiresolution graph topology with a bunch of efficiency tricks.
It was also known (but not wellknown) that graphbased retrieval algorithms can work for generic (mostly metric) distances. So, I personally was interested in pushing these (and other) methods even further and applying them to nonmetric and nonsymmetric similarities. One ultimate objective was to replace or complement a standard termbased inverted file in the text retrieval scenario. Well, the idea to apply kNN search to text retrieval is not novel (see, again, my thesis for some references). However, I do no think that anybody has shown convincingly that this is a viable approach.
On the way towards achieving this objective, there are a lot of difficulties. First of all, it is not clear which representations of text and queries one can use (I have somewhat explored this direction, but the problem is clearly quite hard). Ideally, we would represent everything as dense vectors, but I do not think that the cosine similarity between dense vectors is particularly effective in the domain of adhoc text retrieval (it works better for classification, though). I am also convinced that in many cases whenever dense representations work well, a combination of dense and sparse bagofwords representations works even better. Should we embrace these hybrid representations in the future, we cannot use traditional termbased inverted files directly (i.e., without doing a simpler search with subsequent reranking). Instead, we are likely to rely on more generic algorithms for knearest neighbor (kNN) search.
Second, instead of trying to search using a complex similarity, we can use such a similarity only for reranking. Of course, there should be obviously limits to the reranking approach. However, a reranking bagofwords pipeline (possibly with some query rewriting) is a baseline that is hard to beat.
Third, kNN search is a notoriously hard problem, which in many cases cannot be solved exactly without sequentially comparing the query with every data point (the so called bruteforce search). This is due to a wellknown phenomenon called the curse of dimensionality. Often we have to resort to using approximate search algorithms, but these algorithms are not necessarily accurate. How much inaccuracy is ok? From my experimental results I conclude that the leeway is quite small: We can trade a bit of accuracy for extra efficiency, but not too much.
Because approximate kNN search leads to loss in accuracy, in my opinion, it does not make sense to use it with simple similarities like BM25. Instead, we should be trying to construct a similarity that beats BM25 by a good margin and do retrieval using this fancier similarity. My conjecture is that by doing so we can be more accurate and more efficient at the same time! This is one of the central ideas of my thesis. On one collection I got promising results supporting this conjecture (which is BTW an improvement of our CIKM'16 results). However, more needs to be done, in particular, by comparing against potentially stronger baselines.
In conclusion, I note that this work would have been impossible without encouragement, inspiration, help, and advice of many people. Foremost, I would like to thank my advisor Eric Nyberg for his guidance, encouragement, patience, and assistance. I greatly appreciate his participation in writing a grant proposal to fund my research topic. I also thank my thesis committee: Jamie Callan, James Allan, Alex Hauptmann, and Shane Culpepper for their feedback.
I express deep and sincere gratitude to my family. I am especially thankful to my wife Anna, who made this adventure possible, and to my mother Valentina who encouraged and supported both me and Anna.
I thank my coauthors Bileg Naidan, David Novak, and Yury Malkov each of whom greatly helped me. Bileg sparked my interest in nonmetric search methods and laid the foundation of our NMSLIB library. Yury made key improvements to the graphbased search algorithms. David greatly improved performance of pivotbased search algorithms, which allowed us to obtain first strong results for text retrieval.
I thank Chris Dyer for the discussion of IBM Model 1; Nikita Avrelin and Alexander Ponomarenko for implementing the first version of SWgraph in NMSLIB; Yubin Kim and Hamed Zamani for the discussion of pseudorelevance feedback techniques (Hamed also greatly helped with Galago); Chenyan Xiong for the helpful discussion on embeddings and entities; Daniel Lemire for providing the implementation of the SIMD intersection algorithm; Lawrence Cayton for providing the data sets, the bbtree code, and answering our questions; Christian Beecks for answering questions regarding the Signature Quadratic Form Distance; Giuseppe Amato and Eric S. Tellez for help with data sets; Lu Jiang for the helpful discussion of image retrieval algorithms; Vladimir Pestov for the discussion on the curse of dimensionality; Mike Denkowski for the references on BLUEstyle metrics; Karina Figueroa Mora for proposing experiments with the metric VPtree applied directly to nonmetric data. I also thank Stacey Young, Jennifer Lucas, and Kelly Widmaier for their help.
I also greatly appreciate the support from the National Science Foundation, which has been funding this project for two years.
Submitted by srchvrs on Sun, 07/13/2014  18:59
A NonMetric Space Library (NMSLIB) is an ongoing effort to build an effective toolkit for searching in generic metric and nonmetric spaces. It originated as a personal project of my coauthor Bileg (Bilegsaikhan Naidan), who is also a major contributor. In the spring of 2013, we needed to evaluate similarity search methods for nonmetric spaces. Yet, no good software suite was available (nor was there a collection of benchmarks). This is why we created one by expanding Bileg's codebase and making it public.
There are important contributions from other people as well. In particular, I would like to mention Yury Malkov and David Novak. Yury has contributed a blazingly fast variant of the proximity/neighborhood graph . David has proposed a way to greatly improve the quality of pivotbased projections for sparse vector spaces.
We presented our main experimental results at NIPS'13 & VLDB'15 and our engineering work at SISAP'13. Some of the experimental results are available in our VLDB'15 report. We also participated in a public evaluation, codenamed annbenchmarks in 2015 and 2016. Both times our graphbased implementations were quite competitive against stateoftheart methods designed specifically for the Euclidean and the angular distance: randomprojection KDtrees (FLANN and Annoy) and LSH (MIT's FALCONN). Please, check our May 2016 results in the end of the post. It may very well be new stateoftheart in kNN search.
While our methods are fast in the wellstudied Euclidean space equipped with scalar product, they also generalize well to nonEuclidean, nonmetric, and nonvector spaces. For example, check our VLDB'15 evaluation. Fast search times often come at the expense of longer indexing times. From this perspective, LSH and random projection methods do outperform NMSLIB. However, the two approaches are not mutually exclusive. For example, it may be possible to rely on recent major advances in LSH (the efficient FALCONN library) to build proximity graphs in the Euclidean space faster.
From the engineering perspective, our implementation is mature. Nearly all methods can run in a multithreaded mode. We have Python bindings and a multithreaded query server. In particular, there is a native Java client. In general, we steadily drift in the direction of a productionfriendly library. However, our main focus is on the ease of experimentation and comparison to the existing baselines. Here, our work is far from being finished. In particular we now attempt to apply our toolkit to NLP problems. I summarized some thoughts on this topic in a talk at ML lunch. Currently, we work on replacing a termbased retrieval in the ad hoc search task. Some preliminary results will appear in Leo's thesis proposal.
Appendix: our 2016 annbenchmark results. It is our own rerun using only the best methods on the c4.2xlarge Amazon EC2 instance. The results are very similar to the results obtained by Eric in MayJune 2016, but there are some minor differences due to, e.g., selecting a different random subset of queries:
1.19M 100d GloVe, cosine similarity.

1M 128d SIFT features, Euclidean distance:

Submitted by srchvrs on Sun, 05/27/2018  23:21
Last week I shared my time between work and hacking at a forth machine translation marathon in the Americas. This event organized jointly by CMU and Amazon (sponsored by Amazon) was a lot of fun. My small subteam of two people got familiar with OpenNMT, trained EnglishGerman and EnglishUkrainian models, as well as implemented an idea of our team lead. Hey, we have even gotten a tiny 0.5 gain in BLEU for the EnglishUkrainian pair!
We certainly learned a lot of lessons, most of which generalize well beyond the neural machine translation task. One is related to implementation of custom neural modules and PyTorch. Unlike TensorFlow and many other packages, PyTorch belongs to a new crop of neural frameworks, where a neural network (computation) graph is dynamic. What does it mean? It means that you do not have define a computation graph in advance. You can simply write a tensormanipulating code and PyTorch will do all the backpropagation and parameter updating automatically. Another wellknown package with a similar functionality is DyNet.
There are ups and downs to dynamic computation graphs. For one thing, it is much simpler to debug them. For another, there is a lot of magic going behind the scenes, which you need to understand. First of all, one needs to remember that the computation graph is defined by a sequence of manipulations on Tensors and Variables (Variable is a Tensor wrapper that got deprecated in the recent PyTorch). Your sequences should be valid and properly linked so that all the Tensors of interest have a chance to be updated during backpropagation.
Tensorlevel manipulations can easily get hairy. To simplify things, PyTorch introduces an abstraction layer called Module. A Module is a basic building block that has some parameters and a function forward to turn inputs to outputs. If all is done properly, given only the forward function PyTorch can compute the gradients via backpropagation and update the model parameters. The nice thing about PyTorch is that you can easily write a new module by combining several existing ones. There is no need to write arcane description of layers! As we can see from this PyTorch example, we can define a forward network computation in a very straightforward way. Even if you have not seen a line of PyTorch code before, you can easily figure out that this module applies two 2dconvolutions each of which is followed by a RELU nonlinearity:
class Model(nn.Module): def __init__(self): super(Model, self).__init__() self.conv1 = nn.Conv2d(1, 20, 5) self.conv2 = nn.Conv2d(20, 20, 5) def forward(self, x): x = F.relu(self.conv1(x)) return F.relu(self.conv2(x))
But here is a catch that is barely mentioned in PyTorch documentation. Although writing the forward function is sufficient to compute the gradients, it is apparently not sufficient to determine which tensors represent module's parameters. In the above example, the module includes two convolutional neural networks, each of which has parameters to be updated. How does PyTorch know this? Well, turns out that PyTorch overloads the function __setattr__! Thus, it surreptitiously "registers" each submodule when a user makes an assignment like this one:
self.conv1 = nn.Conv2d(1, 20, 5)
Unfortunately, such automatic registration does not work all the time. Imagine, for example, you want to aggregate several submodules whose number is not known in advance. It is very natural to save them all in a list:
class Model(nn.Module): def __init__(self): super(Model, self).__init__() self.sub_modules = [] self.sub_modules.append(nn.Conv2d(1, 20, 5)) # append a module to the list self.sub_modules.append(nn.Conv2d(20, 20, 5)) # append a module to the list
Yet, this is where PyTorch magic stops: If you place the modules to the list, PyTorch will not be able to update their parameters. As a fix, you need to register them explicitly. One way to do this is to call the function add_module
Submitted by srchvrs on Sun, 04/29/2018  23:54
A treebased machine learning method splits the space into regions and fits a regionspecific classification/regression function (typically just a constant). Additional flexibility and capacity can be achieved by bagging and boosting elementary trees, but at the core of these methods is their space partitioning capabilities. A few months ago I started to wonder if neural networks can do a similar thing. My gut feeling was that neural nets should be doing something of the kind, but the mechanics of the neural net space partitioning was unclear to me. I even posted a question on StackExchange, yet, I have not gotten an answer.
However, I recently discovered an insightful video lecture by Ian Goodfellow where he elaborates on the nature of adversarial examples and discusses possible ways to defend from an attacker (spoiler alert: there seems to be no working method yet!). Basically, he argues that adversarial examples is a common Achilles' heel, which comes—in fact somewhat surprisingly—from excessive linearity of existing approaches. Ian Goodfellow argues that this issue plagues almost all standard machine learning approaches including SVMs, treebased methods, and neural networks. What is even more surprising, adversarial examples are transferable across different algorithms: An adversarial example designed to fool, e.g., a decision tree is very likely to fool some other method as well!
Fig 1. A hyperplane separating positive and negative examples.
Ian also notes that neural nets divide the space using an approximately piecewise linear function. This observation basically answers my question about the space partitioning capabilities of neural networks. For neural nets with only truly piecewise linear activation functions (such as RELU), the division of the space is truly linear. As a side comment, this kinda explains (at least for me) the ability of neural networks to generalize quite well despite having so many parameters. A wellknown deficiency of the linear and piecewise linear models (as noted by Ian) is that they can be super confident in regions where they encountered no training data. For example, in Figure 1 the linear classifier would have a very strong "opinion" about the negativity of the point marked by the red question mark. However, this confidence is not fully substantiated, because there are no training points nearby.
Adversarial examples notwithstanding, I think it may be useful to get a better understanding of the space partitioning "mechanics" of the neural networks. In my case, such understanding came from reading a very recent paper AN UPPERBOUND ON THE REQUIRED SIZE OF A NEURAL NETWORK CLASSIFIER by Hossein Valavi and Peter J. Ramadge. The paper is unfortunately paywalled, but the slides are publicly available.
Fig 2. A convex shape (triangle) classifier.
In this paper, authors remind us that neural nets possess two powerful capabilities: an ability to distinguish data points separated by a given hyperplane and a capacity to implement arbitrary logic circuits (they also prove facts about complexity of such circuits, but these proofs are not very relevant to our discussion). Without going into technicalities right away (but some are nevertheless discussed below), this means that neural nets can easily divide a realvector space using polytopes, e.g., convex ones. Let us consider a simple example of a piecewise linear classifier in Fig. 2. As we can see positive and negative training examples can be separated from each other using three planar hyperplanes, simply speaking by three lines. Positive and negative examples are located on the opposite sides of these separating lines. Therefore, a condition of belonging to a negative class can be written as : $\langle x  Q_1, P_1  Q_1 \rangle < 0 \mbox{ and } \langle x  Q_2, P_2  Q_2 \rangle < 0 \mbox{ and } \langle x  Q_3, P_3  Q_3 \rangle < 0$. This condition can be easily implemented with a two layer neural networks with the sign activation function.
Because neural nets can have tons of parameters, they can slice and dice the space at least as well as treebased learning algorithms. In fact, if necessary, given enough parameters, we can always achieve a perfect separation of data points from different classes (assuming consistent labeling). For example, we can place each positiveclass datapoint into its own hyperhube and implement neural nets that "fire" only for training data points in these hypercubes. We will then combine these nets using an OR circuit. Of course, such construct is extremely inefficient and does not generalize well, but it IMHO illustrates the basic spacepartitioning powers of the neural nets.
In conclusion, I discuss some technical details, which are primarily of the theoretical interest. For simplicity of exposition I consider only the binary classification scenario. First, let us consider again a hyperplane that divides the space into two subspaces. The claim is that we can always construct a neural network that maps points on one side of the hyperplane to zeros (negative) and points on the other side of the hyperplane to ones (positive). This fact is trivial for neural networks with the sign activation function (a sign of the scalar product between data points and the hyperplaneorthogonal vector defines the class of a data point). We illustrate this point in Figure 1. However, it is more work to prove for differentiable activation functions such as RELU (which are actually used in practice nowadays). This is done in Lemma 3.1 of the mentioned paper.
Once we divide the space into regions and implement all neural networks that fire exactly when a data point is in its region, we need to combine these classifiers using a boolean circuit. Consider again an example in Fig. 2 In this case a boolean circuit is a simple threevariable conjunction.
There are certainly many ways to implement a binary logic circuit using a neural network. One of these is briefly outlined Corollary 3.3. This Corollary is a bit hard to parse, but, apparently, authors propose to solve a problem in two steps. First, they suggest to map each combination of $n$ zeros and ones to a singlecoordinate data point in a $2^n$dimensional space. Because circuit values can be true or false, there will be "true" or "false" data points in the target $2^n$dimensional space. If coordinate values of true and false points are different, these points are linearly separable (e.g., a "true" data point has exactly one dimension with value one and a "false" data point has exactly one dimension with value two). Thus, according to their Lemma 3.1, there would exist a second neural network based function that maps true points to one and false points to zeros.
Authors do not describe how to implement the mapping to the $2^n$ dimensional space, but we can use the following combination of RELUactivated threshold circuits:
$\max(0, \sum 2^i \cdot x_i  k + 1)  2\max(0, \sum 2^i \cdot x_i  k) + \max(0, \sum 2^i \cdot x_i  k  1) $.
Given a binary input variables $x_i$ this function is equal to one if and only if the input is the same as the binary representation of $k$.
UPDATE: quite interestingly a recent paper provides a more specific characterization of functions represented by feedforward neural networks with RELU activations. Namely, this is a subclass of rational functions belonging to the family of tropical rational maps (a tropical map is piecewise linear).
Pages










