## Bringing a large Russian QA data set to light

"It is achingly apparent that an overwhelming amount of research in speech and language technologies considers exactly one human language: English." (Kyle Gorman) For this reason Emily Bender has been famously encouraging people to (1) explicitly name languages they work on (2) do more work on non-English-data. This has become known as a Bender rule.

Despite the importance of multilingual NLP, frankly speaking, it has been difficult to have an opportunity to work on non-English data (in the previous decade my only major opportunity was a stint on cross-lingual metaphor detection). I am therefore very pleased to have been recently participating in bringing to light a large Russian question-answering/reading-comprehension (QA) data set SberQuAD, which was created similarly to SQuAD.

I have been helping my co-authors Pavel Efimov and Pavel Braslavski (who did nearly all the work) to analyze and describe this data set. We have conducted a very thorough analysis and evaluated several powerful models. The full analysis is available online, but here I would like to highlight the following:

## Universal Methods to Sort Complex Information Tested

About one year ago, a Quanta magazine published an article on the universal method to sort complex information. It is concerned with a theoretical work on solving a nearest-neighbor search problem for various metric distances. Even more precisely, this work attempts to answer a question about what kind of a distance metric permits an efficient nearest neighbor search. Though this is surely an important and solid theoretical work, the Quanta magazine completely ignores the fact that from the practical perspective this problem has satisfactory solutions. Not only existing methods work for metric distances, good results can often be obtained for weird non-metric dissimilarities (or distances as we like to call them). Sometimes, they work when these distances are even non-symmetric!

Are these methods are universal? They certainly are not, but nothing is universal in nearest-neighbor search. There are challenging data sets, which cannot be searched efficiently even with the Euclidean distance! This issue as well as the history of non-metric k-NN search is briefly surveyed in my thesis. However, in some cases we can do really well by using a tree-based or a neighborhood-graph based approaches. In my thesis, I carried out a series of evaluations to verify this. I am pleased that all of the main results are now formally published, in particular, including two recent SISAP papers:

1. Boytsov, L., Nyberg. E., 2019. Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs.

2. Boytsov, L., Nyberg. E., 2019. Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study.

I think these papers are concerned with important research questions and I am going to briefly highlight results.

Neighborhood-graphs is a class of new-old methods, which delivers state of the art results on many data sets. However, little is known how they behave on non-symmetric distances. We were probably the first to test them on non-symmetric distances such as KL-divergence [1, 2]. Turns out, however, these tests relied on data sets that were only mildly non-symmetric. In the follow-up work, we have really stress tested them and discovered the following:

1. It is never a good idea to deal with non-symmetric distances by symmetrizing the distance first and using the symmetrized distance as a part of a filter-and-refine pipeline.

2. However, it is not even necessary. In many cases, indeed, neighborhood-graphs deliver state-of-the-art performance out of the box.

3. Importantly, one has to be consistent in the order of distance function arguments (although there are exceptions as I describe below). If the indexing procedure relies on a different order (e.g., by mistake), the results could be disastrous (I have made this mistake and it cost me a lot of time).

4. That said, using a different distance function at index time can produce sometimes better results. Again this is not a universal property. One somewhat obvious choice of possibly better index-time distance function is a symmetrized variant of the original distance. Quite surprisingly, the argument-reversed distance can deliver good results too, but, as I explain above, the results can be disastrous for some other datasets and distances. I think this discovery begs a research question: what is the optimal distance-time function?

Although graph-based retrieval is state-of-the-art for high-dimensional data it can be an overkill for low-dimensional data, where tree-based approaches can work really well. In particular, we compare two approaches to adapt standard metric tree methods to non-metric similarities. One is the effective piecewise-linear modification of the pruning rule, which we published at NIPS in 2013. In fact, for the Euclidean distance, it is as efficient as the classic projection-based LSH. However, due to the linear nature of the approximation, it is sometimes not a good fit for non-metric dissimilarities. In contrast, Tomas Skopal TriGen algorithm can be better in this case.

TriGen is an ingenious algorithm that finds a monotonic distance transformation that makes a dissimilarity look more like metric. However, TriGen has two drawbacks: it does not work out of the box with non-symmetric distances and its implementation of the distance-modifying transformation can be a bit expensive. What we show is that, perhaps, the best solution is a hybrid: First, we can apply a cheap concave (or near concave) distance transformation such as the square root. Second, we can fit a piecewise-linear decision function for this transformed distance.

In conclusion, I want to emphasize that, although nearest-neighbor search has no universal solution, there are a number of working general-distance approaches. Some good solutions are implemented in NMSLIB, which is the first generic library for metric and non-metric k-NN search.

## Early life of dynamic programming (Concluding part)

Eight years ago I started my blog with a post on the origins of dynamic programming. Therein, I argue that the term programming stems from a military definition of the word "program", which simply means planning and logistics. In mathematics, this term was adopted to denote optimization problems and gave rise to several names such as integer, convex, non-linear, and differentiable programming. I promised to describe how dynamic programming had a somewhat rocky start in computational biology in a follow-up post, but never delivered on this promise.

It has been a decade of phenomenal success of another programming concept, namely the differential programming. Three neural networks pioneers: Geoffrey Hinton, Yoshua Bengio, and Yann LeCun (with a regretful omission of Jürgen Schmidhuber) received a Turing award for "For conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing." Now it seems to be perfect time to deliver on the promise and wrap up with historical dynamic programming posts.

As I mentioned in my first blog post, dynamic programming is a relatively simple way to solve complex problems through an elegant and efficient recursion. In particular, it is at the heart of evolutionary distances in computational biology and the Levenshtein (also known as an edit) distance in natural language processing. Different as they are, these fields rely on string comparison via variants of an edit distance. The simplest way to compute an unweighted edit distance between strings $a=a_1 a_2 \ldots a_n$ and $b = b_1 b_2 \ldots b_m$ is through a following simple recursion:

$$d_{i+1,j+1} = \min \left\{ \begin{array}{c} d_{i,j+1} + 1 \\ d_{i+1,j} + 1 \\ d_{i,j} + (a_{i+1} \ne b_{j+1})\\ \end{array} \right.$$

The computational cost is quadratic. Although there are faster average-case algorithms, there is little hope that the worst-case time is strongly sub-quadratic.

The formula is simple, but not always immediately obvious. In fact, it took the scientific community almost ten years to fully realize that essentially the same approach provides a solution for several fields: information retrieval, bioinformatics, and speech recognition (see my survey for a list of early works). Furthermore, as I explain below, a renowned mathematician Stanislaw Ulam not only failed to discover the formula, but also failed to recognize the solution when it was presented to him by David Sankoff. Next time when you fail to solve an apparently "simple" dynamic programming puzzle, do not beat yourself up!

The edit distance was first published by Levenshtein (and is often called the Levenshtein distance) in the context of self-correcting binary codes (Levenshtein, Vladimir I. "Binary codes capable of correcting deletions, insertions, and reversals." Soviet physics doklady. Vol. 10. No. 8. 1966.). Imagine a noisy channel that modifies binary messages by changing, inserting, and deleting bits occasionally. If such spurious changes are not very frequent, it may still be possible to recover the original message. This is clearly possible but only if the code of one symbol is sufficiently different from other symbol codes (i.e., when the distance is large enough).

Consider an example, where the set of codes contains only two 3-bit codes 000 and 111 and the channel cannot modify more than one bit. Then, a noisy version of 000 will always be different from the noisy version of the code 111. Indeed, the noisy version of 000 can have at most one bit equal to one, while the noisy version of 111 always has at least two bits equal to one. By counting the number of unit bits, we can always recover the original code.

Clearly, there is a lot of waste here, because we use only two values out of eight possible ones. However, this is the price we pay to separate the noisy variants of the codes. Levenstein himself was only interested in estimating the amount of waste that is necessary to make the garbled codes separable. He actually proved that the waste would not be that bad: It is possible to use about $2^n/n$ separable code words of length n (bits).

Despite Levenshtein paper is massively cited, I think few people read it, in particular, because it was not possible to access this paper online. For my survey on approximate dictionary searching, I had to visit the actual Russian library to read the paper. It was a bit complicated, because at that time I already resided in the US.

Levenstein apparently did not realize that the distance he introduced could be useful in text processing applications, such as spell-checking, speech recognition, and alignment of DNA (or protein) sequences. In the 60s, the bioinformatics was in its early stage. The structure of the DNA was discovered in 1953, but the first complete gene was not decoded until 1972! Nevertheless, there was apparently a lot of interest in the problem of sequencing and finding similar regions among sequences of different species (note, however, this is not the only goal of finding similar sequences).

The usefulness of the latter method is based on the assumption that similarities in genetic sequences represent common ancestry. Two species start from a common genome and then follow different evolutionary paths, which results in changing certain areas of the original DNA. Yet, most subsequences of the DNA remain very similar. In a simplified model, the Nature "edits" a sequence by randomly deleting, inserting, or modifying certain small areas of the genes. These changes happen with a certain rate: The longer is a time frame, the more changes happen. By measuring the volume of the differences between two species (using some form of the edit distance) we can estimate the time required to evolve from a single common ancestor. (See, e.g., Nei, M., & Zhang, J. (2006). Evolutionary Distance: Estimation for more details.)

Stanislaw Ulam, a famous mathematician who played one of the pivotal roles in the Manhattan project, is often credited with the invention of the evolutionary distance. However, as argued by David Sankoff, he failed to realize that the distance can be computed by a simple dynamic programming algorithm. Turns out that dynamic programming was not so simple after all.

## Adversarial AI vs Evil AI in layman's terms

This is written in response to a Quora question asking to explain in layman's terms the difference between adversarial and evil AI. Feel free to vote on my answer on Quora.

This is an excellent question! For starters, in my opinion, the current AI heavily relies on statistical learning methods, which are rather basic. For this reason it is nowhere near producing sufficiently intelligent machines, let alone machines that can have feelings, emotions, free will etc. There are algorithmic and hardware limitations that I cover in my blog post (also available as a Quora answer).

Modern AI cannot be evil in a traditional sense human sense of the word, however, it can can cause a lot of harm as any other immature technology. For example, despite the famous claim by a Turing award winner G. Hinton that we would have to stop training radiologists roughly today, there is mounting evidence that deep learning methods for image analysis do not always work well.

Furthermore, statistical methods (aka AI) are becoming ubiquitous tools of decision making (money lending, job search, and even jailing people). However, statistical learning methods are not fair and can be biased against certain groups of people. From this perspective AI can be considered evil. Of course, humans are biased too, but human opinions are diverse and we, humans, tend to improve. Having a single black-box uncontrollable decision algorithm that becomes more and more biased is a scary perspective.

Modern AI is not reliable and immature: It works only in very constrained environments. Why is that? Because statistical learning is a rear-mirror-view approach that makes future decision based on patterns observed in the past (aka training data). Once the actual (test) data diverges from training data in terms of the statistical properties, performance of modern AI decreases quite sharply.

In fact, it is possible to tweak the data slightly to decrease the performance of an AI system. This is called an adversarial attack. For example, there is research showing that addition of distractor phrases does not confuse humans much, but it completely “destroys” performance of a natural language understanding system. For the reference, the modern history of adversarial examples started from the famous paper by Szegedy et al 2013. They showed that small image perturbations, which are too small to be noticed by humans, completely confuse deep neural networks.

In summary, the adversarial AI has nothing to do with the evil AI. It concerns primarily with devising methods to fool modern statistical learning methods with (adversarial) examples as well as with methods to defend against such attacks. Clearly, we want models that can withstand adversarial attacks. This is a difficult objective and a lot of researchers specialize in the so called adversarial AI.

## Robert Mercer's contribution to the development of machine translation technologies

This is written in response to a Quora question, which asks about Robert Mercer's contribution to the development of machine translation technologies. Feel free to vote there for my answer on Quora!

Robert Mercer (Peter Brown and a few other folks) played a pivotal and crucial role in the creation of the first modern translation models. They were able to create the first modern large scale noisy-channel translation system and publish the first paper on the subject. They created a series of IBM Model X models and spearheaded a new research direction (which is huge nowadays).

Recently Robert achieved an ACL lifetime achievement award for his pioneering work on machine translation. He was recently interviewed on the topic and there is a nice transcript of the story that uncovers a lot of historical details: Twenty Years of Bitext.