log in | about 

Robert Mercer's contribution to the development of machine translation technologies

This is written in response to a Quora question, which asks about Robert Mercer's contribution to the development of machine translation technologies. Feel free to vote there for my answer on Quora!

Robert Mercer (Peter Brown and a few other folks) played a pivotal and crucial role in the creation of the first modern translation models. They were able to create the first modern large scale noisy-channel translation system and publish the first paper on the subject. They created a series of IBM Model X models and spearheaded a new research direction (which is huge nowadays).

Recently Robert achieved an ACL lifetime achievement award for his pioneering work on machine translation. He was recently interviewed on the topic and there is a nice transcript of the story that uncovers a lot of historical details: Twenty Years of Bitext.

How do we make the architecture more efficient for machine learning systems, such as TensorFlow, without just adding more CPUs, GPUs, or ASCIs?

This is written in response to a Quora question, which asks about improving the efficiency of machine learning models without increasing hardware capacity. Feel free to vote there for my answer on Quora!

Efficiency in machine learning in general and deep learning in particular is a huge topic. Depending on what is the goal, different tricks can be applied.

  1. If the model is too large, or you have an ensemble, you can train a much smaller student model that mimics behavior of a large model. You can train to predict directly the probability distribution (for classification). The classic paper: "Distilling the Knowledge in a Neural Network" by Hinton et al., 2015.

  2. Use a simpler model and/or smaller model, which parallelizes well. For example, one reason transformer neural models are effective is that they are easier/faster to train compared to LSTMs.

  3. If the model does not fit into memory, you can train it using mixed precision: "Mixed precision training" by Narang et al 2018.

  4. Another trick, which comes at the expense of run-time, consists in discarding some of the tensors during training and recomputing them when necessary: "Low-Memory Neural Network Training: A Technical Report" Sohoni et al, 2019. There is a Google library for this: "Introducing GPipe, an Open Source Library for Efficiently Training Large-scale Neural Network Models."

  5. There is a tons of work on quantization (see, e.g., Fixed Point Quantization of Deep Convolutional Networks" by Lin et al 2016) and pruning of neural networks ("The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks" by Frankle and Carbin.) I do not remember a reference, but it is possible to train quantized models directly so that they use less memory.

Benefits of GRUs over LSTMs

This is written in response to a Quora question, which asks about the benefits of GRU over LSTMs. Feel free to vote there for my answer on Quora!

The primary advantage is the speed of training and inference: GRU has two gates instead of three (and fewer parameters). However, a simpler design comes at the expense of inferior capabilities (in theory). There is a paper arguing that LSTMs can count, but GRU can not.

The loss of computational expressivity may not matter much in practice. In fact, there is recent work showing that a trimmed, single-gate LSTM can be quite effective in practice: "The unreasonable effectiveness of the forget gate" by Westhuizen and Lasenby, 2018.

What are some direct implications of Wittgenstein’s work on natural language processing?

This is written in response to a Quora question, which asks about direct implications of Wittgenstein’s work on natural language processing. Feel free to vote there for my answer on Quora!

How could Wittgenstein have influenced modern NLP? Yorick Wilks cited by the question asker hints at three possible aspects:

  1. Distributional semantics
  2. Symbolic representations and computations
  3. Empiricism

Wittgenstein likely played an important role in the establishment of distributional semantics. We mostly cite Firth’s famous "You shall know a word by the company it keeps", but this was preceded by Wittgenstein’s "For a large class of cases—though not for all—in which we employ the word ‘meaning’ it can be defined thus: the meaning of a word is its use in the language." This formulation was given in his “Philosophical Investigations”, published posthumously in 1951, but he started to champion this idea as early as 1930s. It likely influenced later thinkers and possibly even Firth.

Let’s move onto the symbolic representations. In his earlier work Wittgenstein postulates that the world is a totality of facts, i.e., logical propositions (which is called logical atomism). It is not totally clear what could be the practical consequences of this statement (should it be implemented as an NLP paradigm). In addition, Wittgenstein rejected logical atomism later in life. He also declared that it is not possible/productive to define words by mental representations or references to real objects: Instead, one should focus exclusively on the word use. This sounds very "anti-ontology" to me.

Last, but not least, modern NLP has a statistical foundation. However, Wittgenstein never advocated an empirical approach to language understanding. I have found evidence that he dismissed weak empiricism.

My $0.05 on the street value of pre-trained embeddings

This is written in response to a Quora question, which asks about the street value of pre-trained models. Feel free to vote there for my answer! .

This is an interesting question. There’s clearly no definitive answer to it. My personal impression (partly based on my own experience) that with a reasonable amount of training data, pre-training and/or data augmentation is not especially useful (if at all). In particular:

  1. In a recent paper by Facebook, this is demonstrated for an image-detection/segmentation task: Re-thinking ImageNet pre-training. He et al. 2018.
  2. A couple of recent chilling results:
    1. Researchers from Google and Carnegie Mellon university showed that a 300x (!) increase in the number of training examples only modestly improves performance. I think it is an especially interesting result, because the data is only weakly supervised (i.e., it is the most realistic big-data scenario).
    2. Unsupervised training does not work yet for truly low-resource languages: Two New Evaluation Data-Sets for Low-Resource Machine Translation: Nepali–English and Sinhala–English. Guzman et al 2018.
  3. Here is one example from the speech-recognition domain: Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. Rao et al 2018, where pre-training works, but gains are rather modest: "We find CTC pre-training to be helpful improving WER 13.9%→13.2% for voice-search and 8.4%→8.0% for voice-dictation".

Thus, if you are interested in obtaining SOTA results on the dataset of interest, you may need to be very clever and efficient in obtaining tons of training data. That said, pre-training certainly allows one to achieve better results in many cases, especially when the amount of training data is small. This can be really useful for bootstrapping. See, e.g., the following radiology paper: Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. Shin Hoo-Chang et al. 2016.

That said, AI is a fast-developing field, and we see particularly impressive advances in transfer learning for NLP. This series largely started with the following great papers (particularly from the Allen AI ELMO paper):

  1. Semi-supervised Sequence Learning. Andrew M. Dai, Q. Le. 2015.
  2. ELMO paper: Deep contextualized word representations. Peters et al 2018.

Recently, we have seen quite a few improvements on this with papers from Open AI (GPT), Google AI (BERT), and Microsoft (I think it’s called Big Bird, but I am a bit uncertain). These improvements are huge and very encouraging. Let us not forget that the road to these successes have been paved by two seminal papers, which largely started the neural NLP:

  1. Natural language processing (almost) from scratch. R Collobert, J Weston, L Bottou, M Karlen. 2011.
  2. Distributed representations of words and phrases and their compositionality. 2013. T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean.

Both papers proposed its own variant of neural word embeddings, learned in an unsupervised fashion. This was clearly a demonstration of the street value of a pre-trained model in NLP. Furthermore, the first paper, which was a bit ahead of the time, went much further and presented possibly the first suit of core neural NLP tools (for POS tagging, named entity recognition, and parsing). It is worth mentioning that there are also earlier and less-known papers on neural NLP including (but not limited to) a seminal neural language modeling paper by Y. Benigo.

In conclusion, I would also note that a lot of pre-training has been done in the supervised fashion. Perhaps, this is a limiting factor as the amount of supervised data is relatively small. We may be seeing this changing with more effective unsupervised pre-training methods. This has become quite obvious in the NLP domain. However, there is a positive trend in the image community too. For example, in this recent tutorial (scroll to the unsupervised tutorial pre-training), there is a couple of links to recent unsupervised training approaches that rival ImageNet pre-training.

Some further reading: A good overview of the transfer learning is given by S. Ruder.


Subscribe to RSS - srchvrs's blog