My $0.05 on the street value of pre-trained embeddings

Blog

Directory

Submitted by srchvrs on Mon, 02/18/2019 - 00:08

This is written in response to a Quora question, which asks about the street value of pre-trained models. Feel free to vote there for my answer! .

This is an interesting question. There’s clearly no definitive answer to it. My personal impression (partly based on my own experience) that with a reasonable amount of training data, pre-training and/or data augmentation is not especially useful (if at all). In particular:

In a recent paper by Facebook, this is demonstrated for an image-detection/segmentation task: Re-thinking ImageNet pre-training. He et al. 2018.
A couple of recent chilling results:
1. Researchers from Google and Carnegie Mellon university showed that a 300x (!) increase in the number of training examples only modestly improves performance. I think it is an especially interesting result, because the data is only weakly supervised (i.e., it is the most realistic big-data scenario).
2. Unsupervised training does not work yet for truly low-resource languages: Two New Evaluation Data-Sets for Low-Resource Machine Translation: Nepali–English and Sinhala–English. Guzman et al 2018.
Here is one example from the speech-recognition domain: Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. Rao et al 2018, where pre-training works, but gains are rather modest: "We find CTC pre-training to be helpful improving WER 13.9%→13.2% for voice-search and 8.4%→8.0% for voice-dictation".

Thus, if you are interested in obtaining SOTA results on the dataset of interest, you may need to be very clever and efficient in obtaining tons of training data. That said, pre-training certainly allows one to achieve better results in many cases, especially when the amount of training data is small. This can be really useful for bootstrapping. See, e.g., the following radiology paper: Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. Shin Hoo-Chang et al. 2016.

That said, AI is a fast-developing field, and we see particularly impressive advances in transfer learning for NLP. This series largely started with the following great papers (particularly from the Allen AI ELMO paper):

Recently, we have seen quite a few improvements on this with papers from Open AI (GPT), Google AI (BERT), and Microsoft (I think it’s called Big Bird, but I am a bit uncertain). These improvements are huge and very encouraging. Let us not forget that the road to these successes have been paved by two seminal papers, which largely started the neural NLP:

Both papers proposed its own variant of neural word embeddings, learned in an unsupervised fashion. This was clearly a demonstration of the street value of a pre-trained model in NLP. Furthermore, the first paper, which was a bit ahead of the time, went much further and presented possibly the first suit of core neural NLP tools (for POS tagging, named entity recognition, and parsing). It is worth mentioning that there are also earlier and less-known papers on neural NLP including (but not limited to) a seminal neural language modeling paper by Y. Benigo.

In conclusion, I would also note that a lot of pre-training has been done in the supervised fashion. Perhaps, this is a limiting factor as the amount of supervised data is relatively small. We may be seeing this changing with more effective unsupervised pre-training methods. This has become quite obvious in the NLP domain. However, there is a positive trend in the image community too. For example, in this recent tutorial (scroll to the unsupervised tutorial pre-training), there is a couple of links to recent unsupervised training approaches that rival ImageNet pre-training.

Some further reading: A good overview of the transfer learning is given by S. Ruder.

srchvrs's blog

You are here