log in | about 

My $0.05 on the street value of pre-trained embeddings

This is written in response to a Quora question, which asks about the street value of pre-trained models. Feel free to vote there for my answer! .

This is an interesting question. There’s clearly no definitive answer to it. My personal impression (partly based on my own experience) that with a reasonable amount of training data, pre-training and/or data augmentation is not especially useful (if at all). In particular:

  1. In a recent paper by Facebook, this is demonstrated for an image-detection/segmentation task: Re-thinking ImageNet pre-training. He et al. 2018.
  2. A couple of recent chilling results:
    1. Researchers from Google and Carnegie Mellon university showed that a 300x (!) increase in the number of training examples only modestly improves performance. I think it is an especially interesting result, because the data is only weakly supervised (i.e., it is the most realistic big-data scenario).
    2. Unsupervised training does not work yet for truly low-resource languages: Two New Evaluation Data-Sets for Low-Resource Machine Translation: Nepali–English and Sinhala–English. Guzman et al 2018.
  3. Here is one example from the speech-recognition domain: Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. Rao et al 2018, where pre-training works, but gains are rather modest: "We find CTC pre-training to be helpful improving WER 13.9%→13.2% for voice-search and 8.4%→8.0% for voice-dictation".

Thus, if you are interested in obtaining SOTA results on the dataset of interest, you may need to be very clever and efficient in obtaining tons of training data. That said, pre-training certainly allows one to achieve better results in many cases, especially when the amount of training data is small. This can be really useful for bootstrapping. See, e.g., the following radiology paper: Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. Shin Hoo-Chang et al. 2016.

That said, AI is a fast-developing field, and we see particularly impressive advances in transfer learning for NLP. This series largely started with the following great papers (particularly from the Allen AI ELMO paper):

  1. Semi-supervised Sequence Learning. Andrew M. Dai, Q. Le. 2015.
  2. ELMO paper: Deep contextualized word representations. Peters et al 2018.

Recently, we have seen quite a few improvements on this with papers from Open AI (GPT), Google AI (BERT), and Microsoft (I think it’s called Big Bird, but I am a bit uncertain). These improvements are huge and very encouraging. Let us not forget that the road to these successes have been paved by two seminal papers, which largely started the neural NLP:

  1. Natural language processing (almost) from scratch. R Collobert, J Weston, L Bottou, M Karlen. 2011.
  2. Distributed representations of words and phrases and their compositionality. 2013. T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean.

Both papers proposed its own variant of neural word embeddings, learned in an unsupervised fashion. This was clearly a demonstration of the street value of a pre-trained model in NLP. Furthermore, the first paper, which was a bit ahead of the time, went much further and presented possibly the first suit of core neural NLP tools (for POS tagging, named entity recognition, and parsing). It is worth mentioning that there are also earlier and less-known papers on neural NLP including (but not limited to) a seminal neural language modeling paper by Y. Benigo.

In conclusion, I would also note that a lot of pre-training has been done in the supervised fashion. Perhaps, this is a limiting factor as the amount of supervised data is relatively small. We may be seeing this changing with more effective unsupervised pre-training methods. This has become quite obvious in the NLP domain. However, there is a positive trend in the image community too. For example, in this recent tutorial (scroll to the unsupervised tutorial pre-training), there is a couple of links to recent unsupervised training approaches that rival ImageNet pre-training.

Some further reading: A good overview of the transfer learning is given by S. Ruder.

Soviet era joke, which is still relevant today

To understand this Soviet-era joke you need to know two things:

  1. In an ideologically-driven society it is dangerous not to agree with the official/dominant point of view planted by autocratic country rulers.

  2. Planting requires an incessant flood of propaganda making us think that we are doing much better than we actually do.

Now the story:

A state propaganda agent came to give a presentation in a mental facility. He presents in front of a crowd of mentally ill people and tells them how well the economy does, how unbelievably fast it grows, and how much better life will be in the near future.

There is a standing ovation, but one man abstains from participation.

— Why don’t you applaud: asks the official.

— I am not mentally ill, I am a nurse here.

PS: There is a, possibly, obvious, but not-so-funny side of this joke: It remains relevant today in a variety of domains (well beyond politics).

Will natural language processing engineers find it hard to get work in the future (once computers are capable of near-perfect text and speech processing)?

This is written in response to a Quora question, which asks if most NLP engineers will be out of jobs once computers are capable of near-perfect text and speech processing. Feel free to vote there for my answer on Quora! There is also an interesting recent blog post by a Carnegie Mellon professor Jeffrey P. Bigham, where he expresses similar concerns and argues that in the near future artificial intelligence should be treated as a human-computer interaction problem rather than purely an algorithmic problem. I fully agree with this point of view!

So, will NLP be engineers be out of jobs? Yes, absolutely! Future is highly uncertain. As believed by Marvin Minsky, an imperfect human race will create a new robotic race that will not suffer from human limitations and which will inherit the Earth and surrounding planets. As humans have outcompeted and replaced many other species, these robots will outcompete and replace humans. We should not, however, fear the future but fulfill our evolutionary destination and welcome our robot overlords.

As impressive as they are, existing AI systems only seem to be intelligent. We do not know how many dozens, hundreds, and, possibly, thousands of years it will take to create truly intelligent machines. In fact, we do not truly know what it means to be intelligent and what is required to be intelligent. A recent high-profile paper: “Building Machines That Learn and Think Like People”, 2016 by Lake et al., tries to find some answers, but its conclusions are far from being definitive.

Ray Kurzweil famously predicted singularity to happen in 2045 based on the exponential growth of computational capacity. However, the size of the transistor is already about 100x the size of an atom. My guess would be that the current technology has a potential of a 10x increase in capacity. It also seems that there is no production-ready immediate replacement on the horizon. In particular, it is not clear when (and if) 3-d chips will be available.

At the same time, the best GPUs have about 20 billion transistors, while the human brain has 100 billion neurons each of which has 10K connections (synapses) on average. How many transistors are necessary to create an artificial neuron? One of the most advanced custom neural chips TrueNorth implements one million spiking neurons and 256 million synapses on a chip with 5.5 billion transistors with a typical power draw of 70 milliwatts.

Thus, it takes about 20 transistors per synapse. Even if we assume that an artificial neuron is as powerful as the real one (which is likely very far from truth), the current technology is six freaking orders of magnitude behind a human brain! Size notwithstanding, power consumption is also an enormous challenge. According to the above cited report, if TrueNorth is scaled up to the size of the human brain it would require 10,000 times more energy!

Furthermore, it is highly unrealistic to assume that an artificial neuron is nearly as complex as a real one. For example, the following book argues that even a single-cell organism (albeit a rather large one) can exhibit extremely complex behaviors, which include sensing and hunting: Wetware: A Computer in Every Living Cell: Dennis Bray. C elegans has about 500 neural cells, but it has basic sensory system and muscle control! It can reproduce and mate.

As my co-author and friend Daniel Lemire noted, our planes do not fly like birds and submarines do not swim like fishes. We do not have to mimic human brain to solve artificial intelligence tasks. We may not even need a brain-like structure to create a truly thinking machine. However, I would argue that we—using the phrase of the Turing award winner Richard Hamming—simply do not have an attack, i.e., a reasonable way to approach this difficult problem.

Another good observation from Daniel Lemire is that we should expect the unexpected because experts can be easily wrong. For example, there were some predictions about impossibility of flight in early 20th century. Although we should expect breakthroughs anytime, I do not think that impossibility of flight for heavier than air machines is a good example. The first gliders appeared well before the first propelled planes. In fact, some people had very clear ideas about how planes should and could fly. This is not true for the general artificial intelligence and we have not built the first gliders yet.

Traveling to the stars is clearly a difficult problem. Few people would argue with that. However, for some reason everybody thinks that artificial intelligence is something that is just a few (dozens) years away. Well, it could be so. But it could also be harder than interstellar travel.

Even if we can create a human-size neural network, we do not know how to program it efficiently. A state-of-the-art approach to training a model consists in collecting a huge amount of data and making a neural network that finds a mapping from inputs to outputs. This approach truly revolutionizes speech and vision and improves to some degree text processing. However, it might be just a gigantic “fuzzy” memory.

This approach is also incredibly brittle and data greedy. We do not know if we can scale it from hundreds to millions of layers. There are a number of recent papers showing that it is very easy to “poison” training data. For example, in the IMDB sentiment dataset the error rate can be driven from 12% to 23% by adding only 3% poisoned data. State-of-the-art CNNs fail (accuracy drops from 90+% to 10-%) for color modified CIFAR-10 images that are easily classified by humans.

Another big success of neural networks is speech recognition. Perhaps, it is the biggest success so far. For clean speech we can get near human recognition rates. However, on noisy data and especially when multiple speakers are present (an infamous cocktail party setting) the results are quite subhuman. The cocktail party setting is especially bad. It is a big success if you can reduce the word error rate from 90% down to 50% or to 30% (i.e., a computer misses every second or third word).

One clear issue with the current approaches is that clean training data can be quite expensive to obtain. For the existing not-so-clean data (collected in a semi-supervised fashion), there can be only little benefit by scaling (an already huge) training set by further two (!) orders of magnitude.

For example, a recent work by researchers from Google and Carnegie Mellon university has showed that a 300x (!) increase in the number of training examples only modestly improves performance. There is a lot of hope that reinforcement learning will solve these issues, but it does not seem to work yet.

All in all, judging by a good number of publications and blog posts that I have been reading in the last six years, we can now do well in a number of constrained domains. However, the success depends mostly on the existence of human-created training data and tons of engineering effort. In that, I suspect that the success of end-to-end systems (i.e., no engineering effort to modularize the problem and synthesize a system from multiple sometimes handcrafted models) is still limited.

Extending existing techniques to new domains requires many years of work from skilled engineers and scientists. I do not see how this can change in the near future. I actually expect that we will need many more scientists and engineers to continue making good progress. Brace yourself, it looks there is megatons of work ahead.

Demystifying IBM Watson

This is written in response to a Quora question, which asks about internals of IBM Watson question answering (QA) system. Feel free to vote there for my answer! Previously I briefly compared IBM Watson approach to that of DeepMind, albeit without going into details of how IBM Watson works. Here I fill this gap.

I am not sure anybody knows exactly what was under the hood. However, there is a series of papers published by IBM most of which I read end-to-end more than once. One overview paper can be found here. The list of papers can be found here, most PDFs can be easily googled :-) There is also a lengthy (but quite relevant) survey (by an IBM Watson team member J. Prager) that covers some the details of the retrieval-based question answering:

Prager, John. "Open-domain question–answering." Foundations and Trends® in Information Retrieval 1.2 (2007): 91-231.

First things first: IBM Watson team incorporated both symbolic/logical systems and a classic redundancy-based retrieval QA into their system. However, there are only few questions (about 1%) that they were able to answer by logical inference and querying of structured knowledge sources.

I would reiterate that a vast majority of questions are answered using a carefully tuned retrieval-based system, which heavily relies on the fact that Jeopardy answers are factoids: short noun phrases such as named entities (e.g., dates, names of famous persons, or city names). Hence, the QA system does not really need to answer a question, e.g., by synthesizing an answer, or by doing some complicated inference. It should instead extract a potential answer and collect enough statistical evidence that this answer is correct.

And, indeed, a retrieval-based factoid QA system finds passages lexically matching the question and extracts potential answers from these passages. It then uses a carefully tuned statistical model to figure out which candidate answers are good. This model likely does not involve any sophisticated reasoning that humans are capable of. That said, I still consider IBM Watson as one of the greatest achievements in the AI field.

The fact that Jeopardy questions are long greatly helps to find the so-called candidate passages, which are likely to contain an answer. Finding these passages is based largely on the lexical overlap between the question and the answer passage. Stephen Wolfram even ran an experiment where he found that a single search engine can find candidate passages for nearly 70% of all answers.

Furthermore, there is a good coverage of Jeopardy topics in Wikipedia. I cite: "We conducted an experiment to evaluate the coverage of Wikipedia articles on Jeopardy! questions and found that the vast majority of Jeopardy! answers are titles of Wikipedia documents [10]. Of the roughly 5% of Jeopardy! answers that are not Wikipedia titles, some included multiple entities, each of which is a Wikipedia title, such as Red, White, and Blue, whereas others were sentences or verb phrases, such as make a scarecrow or fold an American flag." Chu-Carroll, Jennifer, et al. "Finding needles in the haystack: Search and candidate generation." IBM Journal of Research and Development 56.3.4 (2012): 6-1.

I have to say that just throwing a bag-of-words query into a search engine can be a suboptimal approach, but the IBM Watson team wrote a bunch of complex question-rewriting procedures (in Prolog!) to ensure these queries were good. Not all candidate passages are generated in this way: I have covered another generation approach in another blog post.

After candidate passages are retrieved, IBM Watson extracts potential answers, which is not a trivial task. How does it find them? The actual model is sure rather complicated, but it would largely look for named entities and more generic noun phrases. However, not all entities/phrases are weighted equally. What affects the weights? Three things:

  1. A type of the question and the type of the entity (or rather their compatibility score);
  2. Existence of additional supporting evidence;
  3. How frequently these entities/noun phrases appear in candidate passages.

For example, if the question is "Who is the mayor of Toronto?" we know that the answer is a person. Hence, we can downweigh named entities whose type is not a person. The actual answer typing processing is surely more complicated, and there is a separate paper describing it in more detail:

Murdock, J. William, et al. "Typing candidate answers using type coercion." IBM Journal of Research and Development 56.3.4 (2012): 7-1.

What is important is that incorporating other types of relations (e.g., spatial or temporal) in addition to the answer-question type compatibility did not seem to result in substantial improvements (though some gains were observed). See results in Tables 1 and 2 of the paper:

Kalyanpur, Aditya, et al. "Structured data and inference in DeepQA." IBM Journal of Research and Development 56.3.4 (2012): 10-1.

Furthermore, for each candidate entry X, we can try to construct a query like "X is a mayor of Toronto" and find matching passages with good lexical overlap with this additional evidencing query. If such passages exist, they provide evidence that X is, indeed, an answer to the question.

There is a separate paper devoted to the evidencing process:

Murdock, J. William, et al. "Textual evidence gathering and analysis." IBM Journal of Research and Development 56.3.4 (2012): 8-1.

Last, but not least, the ranking approach (for candidate answers) takes into account the (weighted) number of occurrences. In other words, we expect true answers to appear more frequently in retrieved candidate passages. Although this assumption seems to be a bit simplistic it works well due to redundancy: There are lot of answer passages for simple well-known factoids. A nice paper exploring this phenomenon was written by Jimmy Lin:

Lin, Jimmy. "An exploration of the principles underlying redundancy-based factoid question answering." ACM Transactions on Information Systems (TOIS) 25.2 (2007): 6.

If you find this mini-survey useful, feel free to cite it:

title={Demystifying IBM Watson},
author={Boytsov, Leonid},

Getting up to speed with neural machine translation : How not to burn yourself with PyTorch

Last week I shared my time between work and hacking at a forth machine translation marathon in the Americas. This event organized jointly by CMU and Amazon (sponsored by Amazon) was a lot of fun. My small sub-team of two people got familiar with OpenNMT, trained English-German and English-Ukrainian models, as well as implemented an idea of our team lead Adithya Renduchintala. Hey, we have even gotten a tiny 0.5 gain in BLEU for the English-Ukrainian pair!

We certainly learned a lot of lessons, most of which generalize well beyond the neural machine translation task. One is related to implementation of custom neural modules and PyTorch. Unlike TensorFlow and many other packages, PyTorch belongs to a new crop of neural frameworks, where a neural network (computation) graph is dynamic. What does it mean? It means that you do not have define a computation graph in advance. You can simply write a tensor-manipulating code and PyTorch will do all the back-propagation and parameter updating automatically. Another well-known package with a similar functionality is DyNet.

There are ups and downs to dynamic computation graphs. For one thing, it is much simpler to debug them. For another, there is a lot of magic going behind the scenes, which you need to understand. First of all, one needs to remember that the computation graph is defined by a sequence of manipulations on Tensors and Variables (Variable is a Tensor wrapper that got deprecated in the recent PyTorch). Your sequences should be valid and properly linked so that all the Tensors of interest have a chance to be updated during back-propagation.

Tensor-level manipulations can easily get hairy. To simplify things, PyTorch introduces an abstraction layer called Module. A Module is a basic building block that has some parameters and a function forward to turn inputs to outputs. If all is done properly, given only the forward function PyTorch can compute the gradients via back-propagation and update the model parameters. The nice thing about PyTorch is that you can easily write a new module by combining several existing ones. There is no need to write arcane description of layers! As we can see from this PyTorch example, we can define a forward network computation in a very straightforward way. Even if you have not seen a line of PyTorch code before, you can easily figure out that this module applies two 2d-convolutions each of which is followed by a RELU non-linearity:

  1. class Model(nn.Module):
  2. def __init__(self):
  3. super(Model, self).__init__()
  4. self.conv1 = nn.Conv2d(1, 20, 5)
  5. self.conv2 = nn.Conv2d(20, 20, 5)
  7. def forward(self, x):
  8. x = F.relu(self.conv1(x))
  9. return F.relu(self.conv2(x))

But here is a catch that is barely mentioned in PyTorch documentation. Although writing the forward function is sufficient to compute the gradients, it is apparently not sufficient to determine which tensors represent module's parameters. In the above example, the module includes two convolutional neural networks, each of which has parameters to be updated. How does PyTorch know this? Well, turns out that PyTorch overloads the function __setattr__! Thus, it surreptitiously "registers" each submodule when a user makes an assignment like this one:

  1. self.conv1 = nn.Conv2d(1, 20, 5)

Unfortunately, such automatic registration does not work all the time. Imagine, for example, you want to aggregate several sub-modules whose number is not known in advance. It is very natural to save them all in a list:

  1. class Model(nn.Module):
  2. def __init__(self):
  3. super(Model, self).__init__()
  4. self.sub_modules = []
  5. self.sub_modules.append(nn.Conv2d(1, 20, 5)) # append a module to the list
  6. self.sub_modules.append(nn.Conv2d(20, 20, 5)) # append a module to the list

Yet, this is where PyTorch magic stops: If you place the modules in plain Python list, PyTorch will not be able to update their parameters. As a fix, you need to register them explicitly. One way to do this is to explicitly call the function add_module. A less tedious ways is to use a combination of nn.ModuleList and nn.Sequential. Please, read a discussion here for more details.


Subscribe to RSS - blogs