log in | about 

A brief overview of classic and adversarial data augmentation techniques for speech recognition

I was plowing through papers published in ICASP and InterSpeech in 2018-2019 recently and I would like to summarize my observations. I focused primarily on data augmentation techniques and techniques for noisy/far-field speech recognition adaptation. To make this post more accessible for a reader not familiar with automatic speech recognition (ASR), I also briefly overview speech recognition architectures and their history. The post has several parts: architectures with a short historical overview, non-adversarial data augmentation, adversarial training, adversarial examples, GANs, and data augmentation with GANs, student-teacher approaches (and self-training), and some concluding remarks. Although my literature review focuses on recent papers, I have discovered a bunch of noteworthy historical papers, which I am happy to cite here.


It is difficult to talk about data augmentation and adaptation techniques without briefly overviewing modern architectures for speech recognition. Early speech recognition systems (developed in 1960s) used the dynamic programming algorithm for recognition of isolated words. These approaches were successful, see, e.g.,: Velichko, V. M., and N. G. Zagoruyko. "Automatic recognition of 200 words." International Journal of Man-Machine Studies 2.3 (1970): 223-234. However, early systems struggled with continuous speech. A substantial progress has been made by modeling the output using a Hidden Markov Model (HMM). The Markov model combines the acoustic model scores and prior phoneme probabilities into a single utterance score, which can be further improved from interpolation with language model scores. A phoneme probability or score can be computed using handcrafted features, but later this was replaced by Gaussian Mixture Models (GMM), hence, the name HMM-GMM (more details here). The first published HMM-based recognition system was Carnegie Mellon's Dragon. Fred Jelinek published on the topic one year later.

All early continuous-speech systems (and many modern ones) rely on dictionaries and grammars to carry out a constrained decoding process with the help of the beam-search. The term beam-search was possibly coined by the Carnegie Mellon professor Raj Reddy (who received the Turing award for his contributions to AI). However, as noted by Fred Jelinek, the beam-search-like algorithm itself had been around for quite a while. Crucially, an audio recording is divided into short frames (e.g., 10ms) that are further converted into spectral features, e.g., log-Mel filterbanks. For each frame, or a combination of time-adjacent frames, an acoustic model produces a distribution of phoneme probabilities. Phonemes were later replaced with senones, which are context-dependent and/or state-dependent phonemes (states of an HMM). With the introduction of neural networks it was quite natural to replace GMM-based acoustic models with neural ones. Various options were considered in late 80s and early 90s including fully-connected (DNN) and recurrent neural networks (RNN), see the following book for a summary of approaches: Bourlard, Herve A., and Nelson Morgan. "Connectionist Speech Recognition: A Hybrid Approach." (1993). These approaches are the backbone of modern hybrid HMM-DNN and HMM-RNN (HMM-LSTM) systems. Despite the early introduction, they did not outperform HMM-GMMs until about 2012: Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition." IEEE Signal processing magazine 29 (2012).

The training process of hybrid HMM-NN systems can be rather complicated: A flat start requires training an HMM-GMM system first. In a nutshell, one trains a series of progressively more accurate models by intermittent labeling and acoustic model training steps. The labeling, also known as forced-alignment, takes transcript, an existing model, and identifies timestamps of individual phonemes. This requires pronunciation dictionaries! Once we have timestamp and phoneme information, we can train an acoustic model using a standard cross-entropy training (sequence training can improve upon cross-entropy training later). In principle, it is possible to train hybrid HMM-NN systems end-to-end (i.e., without initial HMM-GMM training): Hadian, Hossein, et al. "End-to-end Speech Recognition Using Lattice-free MMI." Interspeech. 2018.

The acoustic model of the hybrid NMM-NN system has been most commonly a variant of a feed-forward neural network such as a TDNN: A time delay neural network architecture for efficient modeling of long temporal contexts. 2015, V Peddinti, D Povey, S Khudanpur, or a recurrent neural network, mostly an LSTM. There are also recent proposals to use Transformers, e.g.: Transformer-based Acoustic Modeling for Hybrid Speech Recognition, Wang et al 2019.

There are newer end-to-end architectures that do not have separate language and acoustic models. In addition, they can learn a language model as well. Most notable are: CTC and listen-attend-and-spell (LAS). LAS, which is basically a speech-to-text translation model with attention, seems to perform better than CTC. However, according to the page "WER are WE?", it is still mostly outperformed by hybrid HMM-NN models. LAS was traditionally relying on LSTMs, but it may benefit from using Transformers: Karita, Shigeki, et al. "A comparative study on transformer vs rnn in speech applications." arXiv preprint arXiv:1909.06317 (2019). However, there are numerous difficulties in using transformers, which are discussed by Desh Raj.

Non-adversarial data augmentation

One of the most iconic augmentation approaches consists in applying additive noise to create a "noisified" training data. The additive noise can be random or originate from a real source. For example, you can collect a bank of realistic background noises (e.g., sound of traffic, coughing, etc) and combine them additively. A more advanced technique is a famous room simulator, which emulates the effect of reverberations that happen in confined spaces. By varying the size of the room, the relative location of the sound source, and the microphone one can generate tons of extra training data. Although many attribute the room simulator to Kim et al 2017, it was pioneered in the following publication: J.B. Allen and D.A. Berkley. Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America. 1979. Kim et al 2017 do cite Allen and Berkley though.

The room simulator is a relatively complicated approach. A Google's SpecAugment is a much simpler method, which operates directly on log-Mel filterbanks, which represent audio signal spectrum. Authors of SpecAugment directly modify these features using a combination of warping, time, and frequency masking. They were able to substantially improve the performance of the end-to-end listen-attend-and-spell (LAS) model and claim SOTA on Librispeech. However, better results can be achieved using a hybrid HMM-DNN model.

There is also an interesting trend to use speech-synthesis tools for data augmentation. Guo et al generate massive amounts of speech using TTS to train an automatic corrector of the ASR output: Guo, Jinxi, Tara N. Sainath, and Ron J. Weiss. "A spelling correction model for end-to-end speech recognition." ICASSP 2019. A recent paper by Polyak et al. proposed an encoder-decoder speaker-conversion approach trained in a speech reconstruction task (they quantize audio into 256 classes and train using a cross-entropy loss). The encoder is designed to be speaker-independent (or rather speaker-universal) while the decoder is a Wave-Net based decoder parameterized by a latent speaker embedding. Given a new audio sample one can first encode it and then decode using various speaker embeddings, thus, doing voice "transplantation" without any parallel data. McCarthy et al. used this approach to boost performance in a speech translation task: McCarthy, Arya D., Liezl Puzon, and Juan Pino. "SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation. ICASSP 2020.

Adversarial training, adversarial examples, GANs, and data augmentation with GANs

Generative Adversarial Networks (GANs) is a popular adversarial-training technique, where two separate neworks (a generator and a discriminator) play a game. All data points are obtained by first sampling a value of a latent random variable $z$ and then transforming it using the generator network. The generator neural network strives to produce fake data with an objective to fool a discriminator. The discriminator neural network, in turn, learns how to distinguish real data points from fake one.

There are a number of GAN variants including conditional GANs, which generate from both the latent variable $z$ and some input variable $x$. Conditional GANs can produce appealing and nearly photorealistic images (a model aka BigGAN). An overview of many models can be found in Kurach, Karol, et al. "The GAN landscape: Losses, architectures, regularization, and normalization." (2018). There is an interesting paper arguing that gradient-penalized GANs and SVMs can be derived from the same framework: Jolicoeur-Martineau, Alexia, and Ioannis Mitliagkas. "Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs." (2019).

One particularly interesting approach is the famous CycleGAN Zhu, Jun-Yan, et al. "Unpaired image-to-image translation using cycle-consistent adversarial networks." ICCV 2017, which permits to train an image-to-image or speech-to-speech translation model without using a parallel collection of examples (ie.., source-target pairs of data points). I believe CycleGAN has inspired the unsupervised machine translation, although, due to the discrete nature of language data it cannot be applied directly.

Despite a tremendous number of papers on the topic, there has been a limited success in using GANs as data augmentation tools. In particular, the authors of the recent paper "Seeing is Not Necessarily Believing: Limitations of BigGANs for Data Augmentation. Suman Ravuri, Oriol Vinyals, 2019" show that BigGANs cannot produce good-quality training data (at least in their study). It was, therefore, especially interesting for me to study the usefulness of adversarial training for speech applications.

As Ian Godfellow et al. write in their famous paper that introduced GANs: "GANs are often confused with the related concept of “adversarial examples”. Adversarial examples are examples found by using gradient-based optimization directly on the input to a classification network, in order to find examples that are similar to the data yet misclassified". These examples are easy to create for images, because they can be considered as continuous data, but difficult for text, which is discrete. Audio data is also continuous, but creation of adversarial examples was somewhat challenging for audio as well. First of all, in an actual over-the-air attack, the attacking system needs to attack based on the future, i.e., uncertain events (i.e., before the completion of an utterance). However, even in the digital-only attack where the attacking system does know the utterance, it is apparently difficult to create a low-magnitude noise that changes the output of the ASR system. For an interested reader, there is a recent Interspeech paper on this topic: Neekhara, Paarth, et al. "Universal adversarial perturbations for speech recognition systems." Interspeech (2019).

I would like to note that GANs is one technique based on the idea of adversarial training, but it is not the only one. In the very basic form, adversarial training is simply a multi-task learning, where one of the tasks consists in predicting a domain of interest. For example, to train a more robust system one can have an auxiliary loss (and an additional sub-network) that predicts whether speech is clean or noisy. An interesting variant of this approach involves reversal of gradients provided by the additional sub-network: Y. Ganin and V. Lempitsky, "Unsupervised domain adaptation by backpropagation," in ICML 2015.

Adversarial training with gradient reversal can be loosely interpreted as a GAN network, where discriminator and generator have the common sub-network (see Figure 1 in the Ganin and Lempitsky paper). The objective of training is to teach this common sub-network to produce domain-agnostic features. In other words, it learns how to remove domain-specific traces while maintaining the accuracy on the main task (and minimizing the main loss such as recognition accuracy). The additional sub-network, in contrast, learns how to distinguish domain traces no matter how small they are.

There are two recent ICASSP/Interspeech 2019 papers that use adversarial training that I would like to highlight. In Meng, Zhong, Jinyu Li, and Yifan Gong. "Adversarial speaker adaptation." ICASSP 2019 use an adversarial loss to perform speaker adaptation. They do it with a classic hybrid HMM-RNN system. In a hybrid system, there is a separate acoustic model that predicts a distribution of senones (basically context-dependent phones) from a speech-frame. Because the speaker-dependent data is scarce, a speaker-dependent (SD) model can overfit easily. The standard approach to deal with this is the so-called KL-divergence (KLD) training. It is a regularization technique, which "forces" senone distribution produced by the SD model to be close to the distribution of the speaker-independent (SI) model. The explored alternative to KLD training is the adversarial training with gradient reversal. Overall, the paper achieves good gains in the order of 10% over the SI model (due to adaptation). The baseline KLD training allows to outperform the SI model by 5%. Thus, adversarial training leads to an additional 5% accuracy gain compared to the KLD training.

In Liu, Bin, et al. "Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition." Interspeech 2019, the authors argue that various speech-enhancement front-ends introduce distortions. This is claimed to affect end-to-end speech systems quite a bit. They use an adversarial loss, albeit without gradient reversal to fix this. They achieve a small gain compared to re-training the system on "noisified" training data (apparently using additive noise). However, the overall character error rates of about 50% on the noisy data seem to be a bit too high. One may need more drastic solutions to resolve the problem.

Similar to enhancing low-resolution images (aka superresolution), one can use GANs for speech enhancement and noise reduction. Although this may not necessarily improve speech models, it can still improve a perception of sound by humans. I would like to highlight the following two papers:

  1. SEGAN: Speech Enhancement Generative Adversarial Network,” in Proc. INTERSPEECH, 2017
  2. A Qualcomm Technologies paper: Li, Sen, et al. "Speech bandwidth extension using generative adversarial networks." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

The Speech Enhancement GAN (SEGAN) paper operates directly on the waveform. To train and test SEGAN authors add artificial (apparently, additive) noise to an existing clean set. Thus, SEGAN is trained on the parallel data to translate a "noisified" version of the signal into the original, i.e., clean signal. For both objective and subjective (i.e., human perception) tests SEGAN outperformed the Wiener filtering baseline.

The Qualcomm paper by Li, Sen et al. applies conditional GANs to the bandwidth extension: A certain transmission medium may truncate high frequencies to save bandwidth, which results in audio quality loss. There are a number of techniques to restore missing frequencies. Sen Li et al. apply the GAN to line spectral frequencies and outperform classic non-neural approaches. Sen Li et al. mention that deep neural networks were applied to this problem in the past: Wang, Yingxue, et al. "Speech bandwidth expansion based on deep neural networks", Interspeech 2015, but they do not compare to the work of Wang et al 2015. They use, however, their approach to pre-train the GAN generator.

Regular conditional GANs require parallel data (also called paired data) for training. There are attempts to apply the above mentioned CycleGAN approach in the scenario when such data does not exist. In particular, Fang, Fuming, et al. 2018 apply CycleGAN to male-to-female (and vice versa) foice conversion. Zhao, Shengkui, et al. 2019 apply CycleGAN to train the speech recognition system. The paper uses a hybrid and not end-to-end ASR system. If I read their Table 1 correctly, they achieve modest (order of 5%) gains compared to models that use similarly complex acoustic modelling networks. However, they do not compare to training on noise-enhanced data!

In their ICASSP 2018 paper "Exploring speech enhancement with generative adversarial networks for robust speech recognition" Donahue and et al carry out a more comprehensive evaluation, where they compare GANs against some of the best data augmentation techniques, which include both additive and reverberant noise. As a source of reverberant noise they the above-mentioned room simulator technique. A combination of the additive and reverberant noise training is referred to as the multi-style training (MTR).

Donahue et al first trained a previously mentioned SEGAN model to enhance audio and remove noise (i.e., using a GAN as a frontend). This approach was not beneficial. Then, they explored a frequency-domain SEGAN (FSEGAN), which operates on spectral features, namely, log-Mel filterbanks. Compared to applying the GAN to raw audio, FSEGAN requires less computation and is claimed to be more resistant against reverberant noise. The classic filter banks aim to mimic a nonlinear human ear perception of sound, which is more discriminative at lower frequencies. In contrast, Donahue et al use a large number of equally-spaced filters.

Although conditional GANs can have a latent vector variable $z$, which can be used to sample several outputs given a single input, authors find that generators learn to largely ignore $z$. In fact, removing $z$ improves quality and turns the conditional GAN into a fully deterministic "image translator". The overall loss function is a combination of the adversarial loss and the L1-reconstruction loss.

Donahue et al run experiments with an artificially noisified variant of the famous WSJ corpus and create both training and testing sets with added noise. Perhaps surprising to some readers, the model trained on clean data performs abysmally poor on the noisy variant with word error rate (WER) increasing from 11.9% to 72.2%! They further experiment with a speech enhancer trained with an adversarial GAN loss and with a simple reconstruction loss (L1). Speech enhancing models are used in two ways: First they are used as a speech-enhancing front-end that takes original noisy audio features and "cleans/enhances" them. Then, the original speech recognition model is used. This does not require re-training of the speech system. Second, the speech enhancer is used to produce an additional set of features. This requires re-training the model.

When SEGAN is used as a front-end it does not help and the WER becomes as large as 80%. The front-end based on the proposed FSEGAN does reduce WER to 33%. However, simply re-training the model on the noisy training data produces a much better WER of 20%. When speech enhancers are used to augment original features, it becomes possible to beat training on the noisy data. However, the GAN-based enhancer performs worse than the enhancer based on L1 reconstruction loss. It seems that an idea of using a GAN-based enhancing front-end is quite popular in the computer vision domain. However, many papers, see, e.g., this recent publication do not compare against enhancers trained with a simpler reconstruction loss.

Teacher-student approaches and self-training

There are a number of training and adaptation techniques, which use a teacher-student approach. In this approach, an output of one model, i.e., a teacher, is used to train another model, i.e., a student. Two common teacher-student approaches consist in learning from (1) the teacher-provided labels and (2) the teacher-provided distribution of classes (by minimizing the KL-divergence between the distributions). In the speech recognition context, this approach seems to have been applied only to hybrid HMM-NN systems, which, as I mentioned before, have a separate acoustic model that predicts a distribution of senones (basically context-dependent phones) from a speech-frame.

The teacher-student approach has become well-known after Hinton et al. published a paper on knowledge distillation. However, the technique is much older. In modern history, it was published earlier by Li et al.: J. Li, R. Zhao, J.-T. Huang and Y. Gong, “Learning small-size DNN with output-distribution-based criteria,” in Proc. Interspeech, 2014. If we dig a bit deeper, we discover that the teacher-student approach was a hot topic sixty years ago:

  1. Probability of error of some adaptive pattern-recognition machines H Scudder - IEEE Transactions on Information Theory, 1965.
  2. Spragins, John. "Learning without a teacher." IEEE transactions on information theory 12.2 (1966): 223-230.

An approach related to the student-teacher learning is self-training. In this case, the model is used to label/classify a large amount of data that does not have human annotations. This labeling is then used to train a new model. Thus, the model itself is its own teacher! It is not quite clear why this approach is useful: I suspect it is some form of the regularization. Lo and behold, there is a recent theoretical paper seemingly supporting my hunch.

In any case, there are multiple reports that a self-training (self-distillation) approach can work pretty well. In particular, Amazon claims to have achieved 11-13% reduction in the WER after self-training on as much as one million hours of speech: Parthasarathi, Sree Hari Krishnan, and Nikko Strom. "Lessons from building acoustic models with a million hours of speech." ICASSP 2019. In this work, authors employ an HMM-LSTM hybrid system. They first label a lot of data without transcripts and select hypotheses with high confidence scores. Then, they re-train the acoustic model on these labels. The teacher-student approach has also been growing in popularity as an adaptation technique. I spotted quite a few recent adaptation papers that rely on it:

  1. Li, Jinyu, et al. "Developing far-field speaker system via teacher-student learning." ICASSP 2018.
  2. Tan, Tian, Yanmin Qian, and Dong Yu. "Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition." ICASSP 2018.
  3. Kim, Jaeyoung, Mostafa El-Khamy, and Jungwon Lee. "Bridgenets: Student-teacher transfer learning based on recursive neural networks and its application to distant speech recognition." ICASSP 2018.
  4. Suzuki, Takahito, et al. "Knowledge Distillation for Throat Microphone Speech Recognition." Interspeech 2019.

In particular, in the first paper. Li et al 2018 first label close-speech data using an accurate model. Then, they replay the close-talk audio with an artificial mouth through the air. Thus, they get parallel far-field data. Because the simulated data should largely retain the close-speech phone timestamps, one can train a student model to mimic senone distribution of a teacher model.

Concluding remarks

The objective of my literature review was to understand the value of GANs as data augmentation and enhancing techniques, especially, for the speech recognition domain. However, I have found no strong evidence that they can outperform traditional data augmentation methods such as the acoustic room simulator. They may still be useful as speech enhancers for humans, albeit it is not clear to me if they consistently outperform enhancers trained with a reconstruction loss.

I also note that there might be some confusion between a "true" GANs (that have separate discriminator and generator networks) and a more generic concept of adversarial training. The latter can be simply a multi-task training procedure where one of the tasks consists in predicting a domain (possibly with reversing gradients). Unconditional GANs always have a latent random variable that permits sampling. This might allow us to generate more training data. However, such training data is not sufficiently realistic for the purpose of training a machine learning model.

Furthermore, in many applications we use a conditional GAN (which is basically a translation network) without a latent random variable, because it works better this way. There may be a modest gain by enhancing original, potentially noisy, features with a GAN-based enhancer. However, a comparable effect can be achieved by training an enhancer with a simpler reconstruction loss (and without the discriminator network). Doing so usually requires a parallel data set, e.g., a set of noisy speech aligned with the clean speech, but, quite interestingly, a reconstruction loss can still be useful even in the absence of paired data (in particular, for a speaker conversion). When parallel data is not available, one might also benefit from using a CycleGAN that can learn a domain translation model without any aligned data. Unfortunately, there is only limited evidence that such an approach is useful (and it is not clear if it can outperform training with the room simulator).


I thank Desh Raj for the discussion/references related to Transformer architectures and Arya McCarthy for the discussion/references of TTS-based augmentation.

If you find this mini-survey useful, feel free to cite it:

title={A brief overview of classic and adversarial data augmentation techniques for speech recognition},
author={Boytsov, Leonid},

MNIST is super easy and few people know it!

One can never be too surprised by the phenomenal success of the MNIST dataset, which is used in so many image publications. But do people realize how easy this dataset is? One clear measure of hardness is performance of a simplistic k-NN classifier with vanilla L2 metric directly on pixels. As a variant: performance of the k-NN classifier with some basic unsupervised transformations such as the principal component analysis (PCA) or denoising.

I created a small poll to assess what people think about MNIST's k-NN search accuracy. I thank everybody for participation: Fortunately, more than one hundred people responded (most of them are machine learning practitioners and enthusiasts I assume). So, I think the results are rather reliable.

In summary, nearly 40% of the respondents think that the accuracy would be at most 80%, 45% think the accuracy is 95%. Unfortunately, I did not create the option for 90%. I think it would have had quite a few responses as well. That said the vanilla k-NN search on pixels has 97% accuracy and the combination of the PCA and the k-NN classifier has nearly 98% accuracy (here is a notebook to back up 98% claim.). In fact, with a bit of additional pre-processing such as deskewing and denoising, one can get a nearly 99% accuracy.

Turns out that few people realize how effective the k-NN classifier is on MNIST: only 17% voted for 98%. That said, it does not mean that the k-NN classifier is such a good method overall (it can be good for tabular data, see, e.g., this paper by Shlomo Geva, but not for complex image data, check, e.g., out numbers for CIFAR and IMAGENET). It means, however, that MNIST is very easy. Understandably, people need some toy dataset to play and quickly get results with. One better alternative is the fashion MNIST. However, it is not too hard either. A vanilla k-NN classifier has about 85% accuracy and it is probably possible to push the accuracy close to 90% with a bit of preprocessing. Thus, we may need a comparably small, but much more difficult dataset to replace both of them.

Hello precision my old friend!

PREAMBLE:When dealing with retrieval, I have been traditionally using TREC NIST evaluation tools (trec_eval and gdeval) for information retrieval. Despite these tools are old, there has been a good amount of effort invested into making them right. Unfortunately, you have to call them as an external tool. Your program forks and runs out of memory. Despite Linux fork is lazy and does not really copy memory, it still happens. It happens even if you use the posix_spawn function and Python's spawn-type creation of new processes: multiprocessing.set_start_method('spawn')

The issue: I decided to switch to scikit-learn or a similarly-interface code (e.g., MatchZoo classes) to compute the IR metrics. I cross-compared results and I have come to the conclusion that very likely all scikit-learn-like packages are fundamentally broken when it comes to computing the mean average precision (MAP) and the normalized discounted cumulative gain NDCG

To compute both of the metrics, one needs two things:

  1. The list of relevant documents, where the relevance label can be binary or graded
  2. The list of scored/ranked documents.

Ideally, an evaluation tool could ingest this data directly. However, sklearn and other libraries cut the corner by accepting two arrays: y_score and y_true. Effectively each document is paired with its relevance grade, see, e.g., scikit-learn MAP.

Unfortunately, such an evaluation ignores all relevant documents, which are not returned by the system. In that, both NDCG and MAP have a normalizing factor that depends on the number of relevant documents. For example, in my understanding, if your system finds only 10% of all relevant documents, the scikit-learn MAP would produce a 10x larger MAP score compared to NIST trec_eval (and the Wikipedia formula). NDCG is still affected by this issue but to a lesser degree, because scores for omitted relevant documents will be heavily discounted.

I have created the notebook to illustrate this issue using one-query example and the MAP metric. By the way, for some reason, scikit-learn refuses to compute NDCG on this data and fails with a weird error.

Related reading: MAP is often (but not always) a meaningless metric if you do intrinsic evaluation of the k-NN search.

Accurate Lucene BM25 : Redux

About five-six years ago, I discovered that a default Lucene BM25 similarity was giving me sub-optimal results, apparently due to a lossy encoding of document lengths (which was a part of Lucene's efficiency trick). I found this when I reimplemented BM25 on my own, but without a lossy document encoding. On my data, the difference was about 10%, which was far from being a trifle. I have run a good number of experiments where this difference was present. It was clearly not a random fluke or mirage. I eventually created a benchmark and published a blog post. I even made some noise on the Lucene dev list and promised to submit a patch. However, this did not happen as I got busy and Lucene changed its internal API.

Recently I was fortunate enough to revisit this problem thanks to Chris Kamphuis, Arjen P. de Vries, and Jimmy Lin who took me aboard their "Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants". They also did most of the work by testing several BM25 variants of which my accurate Lucene similarity was only a small piece. Somewhat surprisingly, the "often maligned approximation of document length" in Lucene performed nearly as well as the accurate similarity. Another result is that there are only very small differences among various BM25 implementations. I think it is an important finding on which I reflect in the very end of the post (please read that last paragraph).

Now, there are two voices in my head: one that "maligns the approximation of the document length" and another that says this approximation is ok. How should we reconcile the voices? Because the scope and the size of the paper did not permit a more thorough experimentation and description, I have carried an additional code analysis that has not been included into the paper. This analysis is below.

My original experiments were run with Lucene 6 (and earlier versions). Lucene 6 does not encode a document length directly. Instead, it approximates the inverse square root of the length. Thus, it introduces an approximation error for basically every possible document length! Lucene 7 supports the old scheme, but already introduces a new encoding of a document length, which stores small numbers (less than 25) exactly and retains four most significant binary digits for large numbers (see my test code), which is basically a variant of sign-free exponent-shifted quarter-precision format (additionally they count only the number of unique terms, which reduces the value of a document length that needs to be encoded). I think that this new approximation scheme is much more accurate .

Thus, I have to disagree a bit with somewhat optimistic conclusions of our paper that it does not matter which BM25 implementations to use. It seems to be true only for sufficiently careful implementations of BM25, including the recent Lucene's one. However, it is also clearly possible to screw up BM25 rather easily.

In conclusion, I would like to note that results of our paper should be treated in a broader context. There is somewhat anecdotal knowledge that various papers reported different effectiveness values for BM25 similarity on identical collections. Some people (including me) tended to think it was due to differences in BM25 implementations. However, the paper by Trotman et al showed that it was likely due to confounding factors such as the choice of lemmatization/stemming, tokenization, stopping, and data cleaning algorithms: Trotman, A., Puurula, A., & Burgess, B. (2014, November). Improvements to BM25 and language models examined. Clearly, our results support the conclusions made by Trotman et al.

Bringing a large Russian QA data set to light

"It is achingly apparent that an overwhelming amount of research in speech and language technologies considers exactly one human language: English." (Kyle Gorman) For this reason Emily Bender has been famously encouraging people to (1) explicitly name languages they work on (2) do more work on non-English-data. This has become known as a Bender rule.

Despite the importance of multilingual NLP, frankly speaking, it has been difficult to have an opportunity to work on non-English data (in the previous decade my only major opportunity was a stint on cross-lingual metaphor detection). I am therefore very pleased to have been recently participating in bringing to light a large Russian question-answering/reading-comprehension (QA) data set SberQuAD, which was created similarly to SQuAD.

I have been helping my co-authors Pavel Efimov and Pavel Braslavski (who did nearly all the work) to analyze and describe this data set. We have conducted a very thorough analysis and evaluated several powerful models. The full analysis is available online, but here I would like to highlight the following:

SberQuAD was created similarly to Stanford SQuAD. Yet, despite the similarities, all the models perform worse on SberQuAD than on SQuAD, which can be attributed to having only a single answer variant and fewer answers that are named entities. A lot of answers in SberQuAD still often contain an entity, but it is normally only a part of an answer. This stands in contrast to SQuAD where roughly half of the answers are named entities.


Subscribe to RSS - blogs