I was plowing through papers published in ICASP and InterSpeech in 2018-2019 recently and I would like to summarize my observations. I focused primarily on data augmentation techniques and techniques for noisy/far-field speech recognition adaptation. To make this post more accessible for a reader not familiar with automatic speech recognition (ASR), I also briefly overview speech recognition architectures and their history. The post has several parts: architectures with a short historical overview, non-adversarial data augmentation, adversarial training, adversarial examples, GANs, and data augmentation with GANs, student-teacher approaches (and self-training), and some concluding remarks. Although my literature review focuses on recent papers, I have discovered a bunch of noteworthy historical papers, which I am happy to cite here.

#### Architectures

It is difficult to talk about data augmentation and adaptation techniques without briefly overviewing modern architectures for speech recognition. Early speech recognition systems (developed in 1960s) used the dynamic programming algorithm for recognition of isolated words. These approaches were successful, see, e.g.,: Velichko, V. M., and N. G. Zagoruyko. "Automatic recognition of 200 words." International Journal of Man-Machine Studies 2.3 (1970): 223-234. However, early systems struggled with continuous speech. A substantial progress has been made by modeling the output using a Hidden Markov Model (HMM). The Markov model combines the acoustic model scores and prior phoneme probabilities into a single utterance score, which can be further improved from interpolation with language model scores. A phoneme probability or score can be computed using handcrafted features, but later this was replaced by Gaussian Mixture Models (GMM), hence, the name HMM-GMM (more details here). The first published HMM-based recognition system was Carnegie Mellon's Dragon. Fred Jelinek published on the topic one year later.

All early continuous-speech systems (and many modern ones) rely on dictionaries and grammars to carry out a constrained decoding process with the help of the beam-search. The term beam-search was possibly coined by the Carnegie Mellon professor Raj Reddy (who received the Turing award for his contributions to AI). However, as noted by Fred Jelinek, the beam-search-like algorithm itself had been around for quite a while. Crucially, an audio recording is divided into short frames (e.g., 10ms) that are further converted into spectral features, e.g., log-Mel filterbanks. For each frame, or a combination of time-adjacent frames, an acoustic model produces a distribution of phoneme probabilities. Phonemes were later replaced with senones, which are context-dependent and/or state-dependent phonemes (states of an HMM). With the introduction of neural networks it was quite natural to replace GMM-based acoustic models with neural ones. Various options were considered in late 80s and early 90s including fully-connected (DNN) and recurrent neural networks (RNN), see the following book for a summary of approaches: Bourlard, Herve A., and Nelson Morgan. "Connectionist Speech Recognition: A Hybrid Approach." (1993). These approaches are the backbone of modern hybrid HMM-DNN and HMM-RNN (HMM-LSTM) systems. Despite the early introduction, they did not outperform HMM-GMMs until about 2012: Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition." IEEE Signal processing magazine 29 (2012).

The training process of hybrid HMM-NN systems can be rather complicated: A flat start requires training an HMM-GMM system first. In a nutshell, one trains a series of progressively more accurate models by intermittent labeling and acoustic model training steps. The labeling, also known as forced-alignment, takes transcript, an existing model, and identifies timestamps of individual phonemes. This requires pronunciation dictionaries! Once we have timestamp and phoneme information, we can train an acoustic model using a standard cross-entropy training (sequence training can improve upon cross-entropy training later). In principle, it is possible to train hybrid HMM-NN systems end-to-end (i.e., without initial HMM-GMM training): Hadian, Hossein, et al. "End-to-end Speech Recognition Using Lattice-free MMI." Interspeech. 2018.

The acoustic model of the hybrid NMM-NN system has been most commonly a variant of a feed-forward neural network such as a TDNN: A time delay neural network architecture for efficient modeling of long temporal contexts. 2015, V Peddinti, D Povey, S Khudanpur, or a recurrent neural network, mostly an LSTM. There are also recent proposals to use Transformers, e.g.: Transformer-based Acoustic Modeling for Hybrid Speech Recognition, Wang et al 2019.

There are newer end-to-end architectures that do not have separate language and acoustic models. In addition, they can learn a language model as well. Most notable are: CTC and listen-attend-and-spell (LAS). LAS, which is basically a speech-to-text translation model with attention, seems to perform better than CTC. However, according to the page "WER are WE?", it is still mostly outperformed by hybrid HMM-NN models. LAS was traditionally relying on LSTMs, but it may benefit from using Transformers: Karita, Shigeki, et al. "A comparative study on transformer vs rnn in speech applications." arXiv preprint arXiv:1909.06317 (2019). However, there are numerous difficulties in using transformers, which are discussed by Desh Raj.

One of the most iconic augmentation approaches consists in applying additive noise to create a "noisified" training data. The additive noise can be random or originate from a real source. For example, you can collect a bank of realistic background noises (e.g., sound of traffic, coughing, etc) and combine them additively. A more advanced technique is a famous room simulator, which emulates the effect of reverberations that happen in confined spaces. By varying the size of the room, the relative location of the sound source, and the microphone one can generate tons of extra training data. Although many attribute the room simulator to Kim et al 2017, it was pioneered in the following publication: J.B. Allen and D.A. Berkley. Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America. 1979. Kim et al 2017 do cite Allen and Berkley though.

The room simulator is a relatively complicated approach. A Google's SpecAugment is a much simpler method, which operates directly on log-Mel filterbanks, which represent audio signal spectrum. Authors of SpecAugment directly modify these features using a combination of warping, time, and frequency masking. They were able to substantially improve the performance of the end-to-end listen-attend-and-spell (LAS) model and claim SOTA on Librispeech. However, better results can be achieved using a hybrid HMM-DNN model.

There is also an interesting trend to use speech-synthesis tools for data augmentation. Guo et al generate massive amounts of speech using TTS to train an automatic corrector of the ASR output: Guo, Jinxi, Tara N. Sainath, and Ron J. Weiss. "A spelling correction model for end-to-end speech recognition." ICASSP 2019. A recent paper by Polyak et al. proposed an encoder-decoder speaker-conversion approach trained in a speech reconstruction task (they quantize audio into 256 classes and train using a cross-entropy loss). The encoder is designed to be speaker-independent (or rather speaker-universal) while the decoder is a Wave-Net based decoder parameterized by a latent speaker embedding. Given a new audio sample one can first encode it and then decode using various speaker embeddings, thus, doing voice "transplantation" without any parallel data. McCarthy et al. used this approach to boost performance in a speech translation task: McCarthy, Arya D., Liezl Puzon, and Juan Pino. "SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation. ICASSP 2020.

Generative Adversarial Networks (GANs) is a popular adversarial-training technique, where two separate neworks (a generator and a discriminator) play a game. All data points are obtained by first sampling a value of a latent random variable $z$ and then transforming it using the generator network. The generator neural network strives to produce fake data with an objective to fool a discriminator. The discriminator neural network, in turn, learns how to distinguish real data points from fake one.

There are a number of GAN variants including conditional GANs, which generate from both the latent variable $z$ and some input variable $x$. Conditional GANs can produce appealing and nearly photorealistic images (a model aka BigGAN). An overview of many models can be found in Kurach, Karol, et al. "The GAN landscape: Losses, architectures, regularization, and normalization." (2018). There is an interesting paper arguing that gradient-penalized GANs and SVMs can be derived from the same framework: Jolicoeur-Martineau, Alexia, and Ioannis Mitliagkas. "Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs." (2019).

One particularly interesting approach is the famous CycleGAN Zhu, Jun-Yan, et al. "Unpaired image-to-image translation using cycle-consistent adversarial networks." ICCV 2017, which permits to train an image-to-image or speech-to-speech translation model without using a parallel collection of examples (ie.., source-target pairs of data points). I believe CycleGAN has inspired the unsupervised machine translation, although, due to the discrete nature of language data it cannot be applied directly.

Despite a tremendous number of papers on the topic, there has been a limited success in using GANs as data augmentation tools. In particular, the authors of the recent paper "Seeing is Not Necessarily Believing: Limitations of BigGANs for Data Augmentation. Suman Ravuri, Oriol Vinyals, 2019" show that BigGANs cannot produce good-quality training data (at least in their study). It was, therefore, especially interesting for me to study the usefulness of adversarial training for speech applications.

As Ian Godfellow et al. write in their famous paper that introduced GANs: "GANs are often confused with the related concept of “adversarial examples”. Adversarial examples are examples found by using gradient-based optimization directly on the input to a classification network, in order to find examples that are similar to the data yet misclassified". These examples are easy to create for images, because they can be considered as continuous data, but difficult for text, which is discrete. Audio data is also continuous, but creation of adversarial examples was somewhat challenging for audio as well. First of all, in an actual over-the-air attack, the attacking system needs to attack based on the future, i.e., uncertain events (i.e., before the completion of an utterance). However, even in the digital-only attack where the attacking system does know the utterance, it is apparently difficult to create a low-magnitude noise that changes the output of the ASR system. For an interested reader, there is a recent Interspeech paper on this topic: Neekhara, Paarth, et al. "Universal adversarial perturbations for speech recognition systems." Interspeech (2019).

I would like to note that GANs is one technique based on the idea of adversarial training, but it is not the only one. In the very basic form, adversarial training is simply a multi-task learning, where one of the tasks consists in predicting a domain of interest. For example, to train a more robust system one can have an auxiliary loss (and an additional sub-network) that predicts whether speech is clean or noisy. An interesting variant of this approach involves reversal of gradients provided by the additional sub-network: Y. Ganin and V. Lempitsky, "Unsupervised domain adaptation by backpropagation," in ICML 2015.

Adversarial training with gradient reversal can be loosely interpreted as a GAN network, where discriminator and generator have the common sub-network (see Figure 1 in the Ganin and Lempitsky paper). The objective of training is to teach this common sub-network to produce domain-agnostic features. In other words, it learns how to remove domain-specific traces while maintaining the accuracy on the main task (and minimizing the main loss such as recognition accuracy). The additional sub-network, in contrast, learns how to distinguish domain traces no matter how small they are.

In Liu, Bin, et al. "Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition." Interspeech 2019, the authors argue that various speech-enhancement front-ends introduce distortions. This is claimed to affect end-to-end speech systems quite a bit. They use an adversarial loss, albeit without gradient reversal to fix this. They achieve a small gain compared to re-training the system on "noisified" training data (apparently using additive noise). However, the overall character error rates of about 50% on the noisy data seem to be a bit too high. One may need more drastic solutions to resolve the problem.

Similar to enhancing low-resolution images (aka superresolution), one can use GANs for speech enhancement and noise reduction. Although this may not necessarily improve speech models, it can still improve a perception of sound by humans. I would like to highlight the following two papers:

1. SEGAN: Speech Enhancement Generative Adversarial Network,” in Proc. INTERSPEECH, 2017
2. A Qualcomm Technologies paper: Li, Sen, et al. "Speech bandwidth extension using generative adversarial networks." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

The Speech Enhancement GAN (SEGAN) paper operates directly on the waveform. To train and test SEGAN authors add artificial (apparently, additive) noise to an existing clean set. Thus, SEGAN is trained on the parallel data to translate a "noisified" version of the signal into the original, i.e., clean signal. For both objective and subjective (i.e., human perception) tests SEGAN outperformed the Wiener filtering baseline.

The Qualcomm paper by Li, Sen et al. applies conditional GANs to the bandwidth extension: A certain transmission medium may truncate high frequencies to save bandwidth, which results in audio quality loss. There are a number of techniques to restore missing frequencies. Sen Li et al. apply the GAN to line spectral frequencies and outperform classic non-neural approaches. Sen Li et al. mention that deep neural networks were applied to this problem in the past: Wang, Yingxue, et al. "Speech bandwidth expansion based on deep neural networks", Interspeech 2015, but they do not compare to the work of Wang et al 2015. They use, however, their approach to pre-train the GAN generator.

Regular conditional GANs require parallel data (also called paired data) for training. There are attempts to apply the above mentioned CycleGAN approach in the scenario when such data does not exist. In particular, Fang, Fuming, et al. 2018 apply CycleGAN to male-to-female (and vice versa) foice conversion. Zhao, Shengkui, et al. 2019 apply CycleGAN to train the speech recognition system. The paper uses a hybrid and not end-to-end ASR system. If I read their Table 1 correctly, they achieve modest (order of 5%) gains compared to models that use similarly complex acoustic modelling networks. However, they do not compare to training on noise-enhanced data!

In their ICASSP 2018 paper "Exploring speech enhancement with generative adversarial networks for robust speech recognition" Donahue and et al carry out a more comprehensive evaluation, where they compare GANs against some of the best data augmentation techniques, which include both additive and reverberant noise. As a source of reverberant noise they the above-mentioned room simulator technique. A combination of the additive and reverberant noise training is referred to as the multi-style training (MTR).

Donahue et al first trained a previously mentioned SEGAN model to enhance audio and remove noise (i.e., using a GAN as a frontend). This approach was not beneficial. Then, they explored a frequency-domain SEGAN (FSEGAN), which operates on spectral features, namely, log-Mel filterbanks. Compared to applying the GAN to raw audio, FSEGAN requires less computation and is claimed to be more resistant against reverberant noise. The classic filter banks aim to mimic a nonlinear human ear perception of sound, which is more discriminative at lower frequencies. In contrast, Donahue et al use a large number of equally-spaced filters.

Although conditional GANs can have a latent vector variable $z$, which can be used to sample several outputs given a single input, authors find that generators learn to largely ignore $z$. In fact, removing $z$ improves quality and turns the conditional GAN into a fully deterministic "image translator". The overall loss function is a combination of the adversarial loss and the L1-reconstruction loss.

Donahue et al run experiments with an artificially noisified variant of the famous WSJ corpus and create both training and testing sets with added noise. Perhaps surprising to some readers, the model trained on clean data performs abysmally poor on the noisy variant with word error rate (WER) increasing from 11.9% to 72.2%! They further experiment with a speech enhancer trained with an adversarial GAN loss and with a simple reconstruction loss (L1). Speech enhancing models are used in two ways: First they are used as a speech-enhancing front-end that takes original noisy audio features and "cleans/enhances" them. Then, the original speech recognition model is used. This does not require re-training of the speech system. Second, the speech enhancer is used to produce an additional set of features. This requires re-training the model.

When SEGAN is used as a front-end it does not help and the WER becomes as large as 80%. The front-end based on the proposed FSEGAN does reduce WER to 33%. However, simply re-training the model on the noisy training data produces a much better WER of 20%. When speech enhancers are used to augment original features, it becomes possible to beat training on the noisy data. However, the GAN-based enhancer performs worse than the enhancer based on L1 reconstruction loss. It seems that an idea of using a GAN-based enhancing front-end is quite popular in the computer vision domain. However, many papers, see, e.g., this recent publication do not compare against enhancers trained with a simpler reconstruction loss.

#### Teacher-student approaches and self-training

There are a number of training and adaptation techniques, which use a teacher-student approach. In this approach, an output of one model, i.e., a teacher, is used to train another model, i.e., a student. Two common teacher-student approaches consist in learning from (1) the teacher-provided labels and (2) the teacher-provided distribution of classes (by minimizing the KL-divergence between the distributions). In the speech recognition context, this approach seems to have been applied only to hybrid HMM-NN systems, which, as I mentioned before, have a separate acoustic model that predicts a distribution of senones (basically context-dependent phones) from a speech-frame.

The teacher-student approach has become well-known after Hinton et al. published a paper on knowledge distillation. However, the technique is much older. In modern history, it was published earlier by Li et al.: J. Li, R. Zhao, J.-T. Huang and Y. Gong, “Learning small-size DNN with output-distribution-based criteria,” in Proc. Interspeech, 2014. If we dig a bit deeper, we discover that the teacher-student approach was a hot topic sixty years ago:

1. Probability of error of some adaptive pattern-recognition machines H Scudder - IEEE Transactions on Information Theory, 1965.
2. Spragins, John. "Learning without a teacher." IEEE transactions on information theory 12.2 (1966): 223-230.

An approach related to the student-teacher learning is self-training. In this case, the model is used to label/classify a large amount of data that does not have human annotations. This labeling is then used to train a new model. Thus, the model itself is its own teacher! It is not quite clear why this approach is useful: I suspect it is some form of the regularization. Lo and behold, there is a recent theoretical paper seemingly supporting my hunch.

In any case, there are multiple reports that a self-training (self-distillation) approach can work pretty well. In particular, Amazon claims to have achieved 11-13% reduction in the WER after self-training on as much as one million hours of speech: Parthasarathi, Sree Hari Krishnan, and Nikko Strom. "Lessons from building acoustic models with a million hours of speech." ICASSP 2019. In this work, authors employ an HMM-LSTM hybrid system. They first label a lot of data without transcripts and select hypotheses with high confidence scores. Then, they re-train the acoustic model on these labels. The teacher-student approach has also been growing in popularity as an adaptation technique. I spotted quite a few recent adaptation papers that rely on it:

In particular, in the first paper. Li et al 2018 first label close-speech data using an accurate model. Then, they replay the close-talk audio with an artificial mouth through the air. Thus, they get parallel far-field data. Because the simulated data should largely retain the close-speech phone timestamps, one can train a student model to mimic senone distribution of a teacher model.

#### Concluding remarks

The objective of my literature review was to understand the value of GANs as data augmentation and enhancing techniques, especially, for the speech recognition domain. However, I have found no strong evidence that they can outperform traditional data augmentation methods such as the acoustic room simulator. They may still be useful as speech enhancers for humans, albeit it is not clear to me if they consistently outperform enhancers trained with a reconstruction loss.

I also note that there might be some confusion between a "true" GANs (that have separate discriminator and generator networks) and a more generic concept of adversarial training. The latter can be simply a multi-task training procedure where one of the tasks consists in predicting a domain (possibly with reversing gradients). Unconditional GANs always have a latent random variable that permits sampling. This might allow us to generate more training data. However, such training data is not sufficiently realistic for the purpose of training a machine learning model.

Furthermore, in many applications we use a conditional GAN (which is basically a translation network) without a latent random variable, because it works better this way. There may be a modest gain by enhancing original, potentially noisy, features with a GAN-based enhancer. However, a comparable effect can be achieved by training an enhancer with a simpler reconstruction loss (and without the discriminator network). Doing so usually requires a parallel data set, e.g., a set of noisy speech aligned with the clean speech, but, quite interestingly, a reconstruction loss can still be useful even in the absence of paired data (in particular, for a speaker conversion). When parallel data is not available, one might also benefit from using a CycleGAN that can learn a domain translation model without any aligned data. Unfortunately, there is only limited evidence that such an approach is useful (and it is not clear if it can outperform training with the room simulator).

#### Acknowledgements

I thank Desh Raj for the discussion/references related to Transformer architectures and Arya McCarthy for the discussion/references of TTS-based augmentation.

If you find this mini-survey useful, feel free to cite it:

@misc{Boytsov2020_04_01,
title={A brief overview of classic and adversarial data augmentation techniques for speech recognition},
url={http://searchivarius.org/blog/data_augm_gan_2020},
author={Boytsov, Leonid},
year={2020},
month={Apr}}