log in | about 
 

Demystifying IBM Watson

This is written in response to a Quora question, which asks about internals of IBM Watson question answering (QA) system. Feel free to vote there for my answer! Previously I briefly compared IBM Watson approach to that of DeepMind, albeit without going into details of how IBM Watson works. Here I fill this gap.

I am not sure anybody knows exactly what was under the hood. However, there is a series of papers published by IBM most of which I read end-to-end more than once. One overview paper can be found here. The list of papers can be found here, most PDFs can be easily googled :-) There is also a lengthy (but quite relevant) survey (by an IBM Watson team member J. Prager) that covers some the details of the retrieval-based question answering:

Prager, John. "Open-domain question–answering." Foundations and Trends® in Information Retrieval 1.2 (2007): 91-231.

First things first: IBM Watson team incorporated both symbolic/logical systems and a classic redundancy-based retrieval QA into their system. However, there are only few questions (about 1%) that they were able to answer by logical inference and querying of structured knowledge sources.

I would reiterate that a vast majority of questions are answered using a carefully tuned retrieval-based system, which heavily relies on the fact that Jeopardy answers are factoids: short noun phrases such as named entities (e.g., dates, names of famous persons, or city names). Hence, the QA system does not really need to answer a question, e.g., by synthesizing an answer, or by doing some complicated inference. It should instead extract a potential answer and collect enough statistical evidence that this answer is correct.

And, indeed, a retrieval-based factoid QA system finds passages lexically matching the question and extracts potential answers from these passages. It then uses a carefully tuned statistical model to figure out which candidate answers are good. This model likely does not involve any sophisticated reasoning that humans are capable of. That said, I still consider IBM Watson as one of the greatest achievements in the AI field.

The fact that Jeopardy questions are long greatly helps to find the so-called candidate passages, which are likely to contain an answer. Finding these passages is based largely on the lexical overlap between the question and the answer passage. Stephen Wolfram even ran an experiment where he found that a single search engine can find candidate passages for nearly 70% of all answers.

Furthermore, there is a good coverage of Jeopardy topics in Wikipedia. I cite: "We conducted an experiment to evaluate the coverage of Wikipedia articles on Jeopardy! questions and found that the vast majority of Jeopardy! answers are titles of Wikipedia documents [10]. Of the roughly 5% of Jeopardy! answers that are not Wikipedia titles, some included multiple entities, each of which is a Wikipedia title, such as Red, White, and Blue, whereas others were sentences or verb phrases, such as make a scarecrow or fold an American flag." Chu-Carroll, Jennifer, et al. "Finding needles in the haystack: Search and candidate generation." IBM Journal of Research and Development 56.3.4 (2012): 6-1.

I have to say that just throwing a bag-of-words query into a search engine can be a suboptimal approach, but the IBM Watson team wrote a bunch of complex question-rewriting procedures (in Prolog!) to ensure these queries were good. Not all candidate passages are generated in this way: I have covered another generation approach in another blog post.

After candidate passages are retrieved, IBM Watson extracts potential answers, which is not a trivial task. How does it find them? The actual model is sure rather complicated, but it would largely look for named entities and more generic noun phrases. However, not all entities/phrases are weighted equally. What affects the weights? Three things:

  1. A type of the question and the type of the entity (or rather their compatibility score);
  2. Existence of additional supporting evidence;
  3. How frequently these entities/noun phrases appear in candidate passages.

For example, if the question is "Who is the mayor of Toronto?" we know that the answer is a person. Hence, we can downweigh named entities whose type is not a person. The actual answer typing processing is surely more complicated, and there is a separate paper describing it in more detail:

Murdock, J. William, et al. "Typing candidate answers using type coercion." IBM Journal of Research and Development 56.3.4 (2012): 7-1.

What is important is that incorporating other types of relations (e.g., spatial or temporal) in addition to the answer-question type compatibility did not seem to result in substantial improvements (though some gains were observed). See results in Tables 1 and 2 of the paper:

Kalyanpur, Aditya, et al. "Structured data and inference in DeepQA." IBM Journal of Research and Development 56.3.4 (2012): 10-1.

Furthermore, for each candidate entry X, we can try to construct a query like "X is a mayor of Toronto" and find matching passages with good lexical overlap with this additional evidencing query. If such passages exist, they provide evidence that X is, indeed, an answer to the question.

There is a separate paper devoted to the evidencing process:

Murdock, J. William, et al. "Textual evidence gathering and analysis." IBM Journal of Research and Development 56.3.4 (2012): 8-1.

Last, but not least, the ranking approach (for candidate answers) takes into account the (weighted) number of occurrences. In other words, we expect true answers to appear more frequently in retrieved candidate passages. Although this assumption seems to be a bit simplistic it works well due to redundancy: There are lot of answer passages for simple well-known factoids. A nice paper exploring this phenomenon was written by Jimmy Lin:

Lin, Jimmy. "An exploration of the principles underlying redundancy-based factoid question answering." ACM Transactions on Information Systems (TOIS) 25.2 (2007): 6.

If you find this mini-survey useful, feel free to cite it:

@misc{Boytsov_2018,
title={Demystifying IBM Watson},
url={http://searchivarius.org/blog/demystifying-ibm-watson},
author={Boytsov, Leonid},
year={2018},
month={Jun}}



Dear childless employee

Preamble: This blog post is inspired by a recent outrage at Facebook and Twitter in regard to parents getting extra time off.

Dear childless employee. We are really sorry to hear that many of you feel so lonely and frustrated nowadays. I believe it can cause a lot of real distress and I also wish employers paid more attention to mental health issues. It should also be covered better through a short term disability insurance or a similar policy, which is regretfully lacking. Understandably, some of you are frustrated that parents have gotten a bit more time off. Remember, however, that this is not a permanent benefit, but rather a short-term measure.

Our family was able to work productively when our daycare was closed, but we are totally sympathetic to people who were not able to do so and we are ready to pick up the slack. We are ready despite we are not as young as a vast majority of Facebook employees and we have had our difficult times when we slept close to five hours a day for many years in a row.

Whether giving parents some preferential treatment is fair is a difficult question, which needs to be considered in a broader social context. Here, there is a typical conservative opinion, which is basically "screw you, you are totally on your own" and a more liberal one, which asserts that (some) redistribution of benefits is good for society in the long run. Whether for-profit companies should be responsible for solving any social issues is a tricky question too. We do not have a full agreement on this even in our family.

Understandably, one trend is to hire mostly young employees, who have lower salary expectations and can more readily put in longer hours. However, there is another trend to create healthier and diverse workplaces, which are welcoming women and minorities, because it may benefit us all in the long run. Remember that lack of adequate parental leave affects disproportionately women, who are often default caregivers.

From this perspective, there is nothing unfair in supporting parents through these difficult times: It is just an integral part of building a healthier workplace. Likewise, we should have support for overworked and overstressed people. I wish unexpected parental leaves were handled via a special insurance (or fund), which is similar to the disability insurance. However, we do not have such government policy and the current pandemic situation is unprecedented.

Being a parent is certainly a privilege and some of it is supported through your taxes. We greatly appreciate this help. However, let us also not forget that societies do love babies: They just do not like to put effort in their upbringing. In theory, we have an overpopulation threat, but, in practice, birth rates seem to be plummeting everywhere and especially in the developed countries. Among these US has been doing pretty well, but even here the average is 1.7 birth per woman.

To stay competitive, the US will need many more smart and hardworking people. I speculate that the US can easily absorb 100-200 million people over a period of three-five decades, but immigration is a difficult topic and it has become tricky to invite even highly qualified people. It is quite sad because a skilled workforce is not a burden but a driver of innovation and economic growth.

In conclusion, my dear childless employee, I would like to remind you that one day you may become a parent too. Whether this happens or not should certainly be your personal choice, which could come with a lot of work and years of sleep deprivation. It could also come with a long commute, because good schools are in the suburbs and not where the offices are. If this ever happens, I really hope that your future managers will have some sympathy for your long commute and will not insist you have to be in the office every day. On the plus side, if you are lucky, parenting can also be quite rewarding, so I hope you might enjoy it as we do now.



On the differences between CPU and GPU or why we cannot use GPU for everything

This is written in response to a Quora question. It is a somewhat vague question wondering why we cannot use GPU hardware for all computation tasks. Feel free to vote there for my answer on Quora!

CPU and GPU are fundamentally very different computational devices, but not many people realize it. CPU has a few low-latency cores, elaborate large caches and flow control (prefetching, branch prediction, etc) and a large relatively inexpensive RAM. GPU is a massively parallel device, which uses an expensive high-throughput memory. GPU memory is optimized for throughput, but not necessarily for latency.

Each GPU core is slow, but there can be thousands of them. When a GPU starts thousands of threads, each thread knows its “number” and uses this number to figure out which part of the “puzzle” it needs to solve (by loading and storing corresponding areas of memory). For example, to carry out a scalar product between two vectors, it is fine to start a GPU thread to multiply just two vector elements. However, it is quite unusual from the perspective of a software developer who has been programming CPUs all their life.

GPU designers make a number of trade-offs that are very different from the CPU trade-offs (in terms of flow control, cache size, and management, etc), which are particularly well suited for parallelizable tasks. However, it does not make GPU universally faster than CPU. GPU works well for massively parallel tasks such as matrix multiplication, but it can be quite inefficient for tasks where massive parallelization is impossible or difficult.

Given a large number of “data-hungry” cores, it is IMHO more important (than in the case of the CPU) to have a high-bandwidth memory (but higher memory latency can be tolerated). Yet, due to a high cost of the GPU memory, its amount is limited. Thus, GPU often relies on external lower-bandwidth memory (such as CPU RAM) to fetch data. If we did not have CPU memory, loading data directly from the disk (even from an SSD disk) would have slowed down many GPU workloads quite substantially. In some cases, this problem can be resolved by connecting GPUs using a fast interconnect (NVLink, Infiniband), but it comes with an extra cost and does not resolve all the issues related to having only very limited memory.

Some answers claim that all GPU cores can do only the same thing, but it is only partially correct. However, cores in the same group (warp) do operate in a lock-step. To process a branch operation, GPU needs to stop some of the cores in the warp and restart them when the branch finishes. Different warps can operate independently (e.g., execute different CUDA kernels).

Furthermore, GPU cores are simpler than CPU cores primarily in terms of the flow control. Yet, they are not primitive by far and support a wide range of arithmetic operations (including lower-precision fast operations). Unlike CPU that manages its caches automatically, GPU have fast shared memory, which is managed explicitly by a software developer (there is also a small L1 cache). Shared memory is essentially a manually-managed cache.

Note that not all GPUs support recursive calls (those that support seem to be pretty restrictive about the recursion depth) and none of the GPUs that I know support virtual memory. In particular, the current CUDA recursion depth seems to be 24. GPUs do not have interrupts and lack support for communication with external IO devices. All these limitations make it difficult or impossible to use GPU as the main processing unit that can run an operating system (See also the following paper for more details: GPUfs: the case for operating system services on GPUs. M Silberstein, B Ford, E Witch, 2014.) I am convinced that future computation systems are going to be hybrid systems that combine low-latency very generic processing units and high-throughput specialized units suitable for massively parallel tasks.



A brief overview of classic and adversarial data augmentation techniques for speech recognition

I was plowing through papers published in ICASP and InterSpeech in 2018-2019 recently and I would like to summarize my observations. I focused primarily on data augmentation techniques and techniques for noisy/far-field speech recognition adaptation. To make this post more accessible for a reader not familiar with automatic speech recognition (ASR), I also briefly overview speech recognition architectures and their history. The post has several parts: architectures with a short historical overview, non-adversarial data augmentation, adversarial training, adversarial examples, GANs, and data augmentation with GANs, student-teacher approaches (and self-training), and some concluding remarks. Although my literature review focuses on recent papers, I have discovered a bunch of noteworthy historical papers, which I am happy to cite here.

Architectures

It is difficult to talk about data augmentation and adaptation techniques without briefly overviewing modern architectures for speech recognition. Early speech recognition systems (developed in 1960s) used the dynamic programming algorithm for recognition of isolated words. These approaches were successful, see, e.g.,: Velichko, V. M., and N. G. Zagoruyko. "Automatic recognition of 200 words." International Journal of Man-Machine Studies 2.3 (1970): 223-234. However, early systems struggled with continuous speech. A substantial progress has been made by modeling the output using a Hidden Markov Model (HMM). The Markov model combines the acoustic model scores and prior phoneme probabilities into a single utterance score, which can be further improved from interpolation with language model scores. A phoneme probability or score can be computed using handcrafted features, but later this was replaced by Gaussian Mixture Models (GMM), hence, the name HMM-GMM (more details here). The first published HMM-based recognition system was Carnegie Mellon's Dragon. Fred Jelinek published on the topic one year later.

All early continuous-speech systems (and many modern ones) rely on dictionaries and grammars to carry out a constrained decoding process with the help of the beam-search. The term beam-search was possibly coined by the Carnegie Mellon professor Raj Reddy (who received the Turing award for his contributions to AI). However, as noted by Fred Jelinek, the beam-search-like algorithm itself had been around for quite a while. Crucially, an audio recording is divided into short frames (e.g., 10ms) that are further converted into spectral features, e.g., log-Mel filterbanks. For each frame, or a combination of time-adjacent frames, an acoustic model produces a distribution of phoneme probabilities. Phonemes were later replaced with senones, which are context-dependent and/or state-dependent phonemes (states of an HMM). With the introduction of neural networks it was quite natural to replace GMM-based acoustic models with neural ones. Various options were considered in late 80s and early 90s including fully-connected (DNN) and recurrent neural networks (RNN), see the following book for a summary of approaches: Bourlard, Herve A., and Nelson Morgan. "Connectionist Speech Recognition: A Hybrid Approach." (1993). These approaches are the backbone of modern hybrid HMM-DNN and HMM-RNN (HMM-LSTM) systems. Despite the early introduction, they did not outperform HMM-GMMs until about 2012: Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition." IEEE Signal processing magazine 29 (2012).

The training process of hybrid HMM-NN systems can be rather complicated: A flat start requires training an HMM-GMM system first. In a nutshell, one trains a series of progressively more accurate models by intermittent labeling and acoustic model training steps. The labeling, also known as forced-alignment, takes transcript, an existing model, and identifies timestamps of individual phonemes. This requires pronunciation dictionaries! Once we have timestamp and phoneme information, we can train an acoustic model using a standard cross-entropy training (sequence training can improve upon cross-entropy training later). In principle, it is possible to train hybrid HMM-NN systems end-to-end (i.e., without initial HMM-GMM training): Hadian, Hossein, et al. "End-to-end Speech Recognition Using Lattice-free MMI." Interspeech. 2018.

The acoustic model of the hybrid NMM-NN system has been most commonly a variant of a feed-forward neural network such as a TDNN: A time delay neural network architecture for efficient modeling of long temporal contexts. 2015, V Peddinti, D Povey, S Khudanpur, or a recurrent neural network, mostly an LSTM. There are also recent proposals to use Transformers, e.g.: Transformer-based Acoustic Modeling for Hybrid Speech Recognition, Wang et al 2019.

There are newer end-to-end architectures that do not have separate language and acoustic models. In addition, they can learn a language model as well. Most notable are: CTC and listen-attend-and-spell (LAS). LAS, which is basically a speech-to-text translation model with attention, seems to perform better than CTC. However, according to the page "WER are WE?", it is still mostly outperformed by hybrid HMM-NN models. LAS was traditionally relying on LSTMs, but it may benefit from using Transformers: Karita, Shigeki, et al. "A comparative study on transformer vs rnn in speech applications." arXiv preprint arXiv:1909.06317 (2019). However, there are numerous difficulties in using transformers, which are discussed by Desh Raj.

Non-adversarial data augmentation

One of the most iconic augmentation approaches consists in applying additive noise to create a "noisified" training data. The additive noise can be random or originate from a real source. For example, you can collect a bank of realistic background noises (e.g., sound of traffic, coughing, etc) and combine them additively. A more advanced technique is a famous room simulator, which emulates the effect of reverberations that happen in confined spaces. By varying the size of the room, the relative location of the sound source, and the microphone one can generate tons of extra training data. Although many attribute the room simulator to Kim et al 2017, it was pioneered in the following publication: J.B. Allen and D.A. Berkley. Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America. 1979. Kim et al 2017 do cite Allen and Berkley though.

The room simulator is a relatively complicated approach. A Google's SpecAugment is a much simpler method, which operates directly on log-Mel filterbanks, which represent audio signal spectrum. Authors of SpecAugment directly modify these features using a combination of warping, time, and frequency masking. They were able to substantially improve the performance of the end-to-end listen-attend-and-spell (LAS) model and claim SOTA on Librispeech. However, better results can be achieved using a hybrid HMM-DNN model.

There is also an interesting trend to use speech-synthesis tools for data augmentation. Guo et al generate massive amounts of speech using TTS to train an automatic corrector of the ASR output: Guo, Jinxi, Tara N. Sainath, and Ron J. Weiss. "A spelling correction model for end-to-end speech recognition." ICASSP 2019. A recent paper by Polyak et al. proposed an encoder-decoder speaker-conversion approach trained in a speech reconstruction task (they quantize audio into 256 classes and train using a cross-entropy loss). The encoder is designed to be speaker-independent (or rather speaker-universal) while the decoder is a Wave-Net based decoder parameterized by a latent speaker embedding. Given a new audio sample one can first encode it and then decode using various speaker embeddings, thus, doing voice "transplantation" without any parallel data. McCarthy et al. used this approach to boost performance in a speech translation task: McCarthy, Arya D., Liezl Puzon, and Juan Pino. "SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation. ICASSP 2020.

Adversarial training, adversarial examples, GANs, and data augmentation with GANs

Generative Adversarial Networks (GANs) is a popular adversarial-training technique, where two separate neworks (a generator and a discriminator) play a game. All data points are obtained by first sampling a value of a latent random variable $z$ and then transforming it using the generator network. The generator neural network strives to produce fake data with an objective to fool a discriminator. The discriminator neural network, in turn, learns how to distinguish real data points from fake one.

There are a number of GAN variants including conditional GANs, which generate from both the latent variable $z$ and some input variable $x$. Conditional GANs can produce appealing and nearly photorealistic images (a model aka BigGAN). An overview of many models can be found in Kurach, Karol, et al. "The GAN landscape: Losses, architectures, regularization, and normalization." (2018). There is an interesting paper arguing that gradient-penalized GANs and SVMs can be derived from the same framework: Jolicoeur-Martineau, Alexia, and Ioannis Mitliagkas. "Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs." (2019).

One particularly interesting approach is the famous CycleGAN Zhu, Jun-Yan, et al. "Unpaired image-to-image translation using cycle-consistent adversarial networks." ICCV 2017, which permits to train an image-to-image or speech-to-speech translation model without using a parallel collection of examples (ie.., source-target pairs of data points). I believe CycleGAN has inspired the unsupervised machine translation, although, due to the discrete nature of language data it cannot be applied directly.

Despite a tremendous number of papers on the topic, there has been a limited success in using GANs as data augmentation tools. In particular, the authors of the recent paper "Seeing is Not Necessarily Believing: Limitations of BigGANs for Data Augmentation. Suman Ravuri, Oriol Vinyals, 2019" show that BigGANs cannot produce good-quality training data (at least in their study). It was, therefore, especially interesting for me to study the usefulness of adversarial training for speech applications.

As Ian Godfellow et al. write in their famous paper that introduced GANs: "GANs are often confused with the related concept of “adversarial examples”. Adversarial examples are examples found by using gradient-based optimization directly on the input to a classification network, in order to find examples that are similar to the data yet misclassified". These examples are easy to create for images, because they can be considered as continuous data, but difficult for text, which is discrete. Audio data is also continuous, but creation of adversarial examples was somewhat challenging for audio as well. First of all, in an actual over-the-air attack, the attacking system needs to attack based on the future, i.e., uncertain events (i.e., before the completion of an utterance). However, even in the digital-only attack where the attacking system does know the utterance, it is apparently difficult to create a low-magnitude noise that changes the output of the ASR system. For an interested reader, there is a recent Interspeech paper on this topic: Neekhara, Paarth, et al. "Universal adversarial perturbations for speech recognition systems." Interspeech (2019).

I would like to note that GANs is one technique based on the idea of adversarial training, but it is not the only one. In the very basic form, adversarial training is simply a multi-task learning, where one of the tasks consists in predicting a domain of interest. For example, to train a more robust system one can have an auxiliary loss (and an additional sub-network) that predicts whether speech is clean or noisy. An interesting variant of this approach involves reversal of gradients provided by the additional sub-network: Y. Ganin and V. Lempitsky, "Unsupervised domain adaptation by backpropagation," in ICML 2015.

Adversarial training with gradient reversal can be loosely interpreted as a GAN network, where discriminator and generator have the common sub-network (see Figure 1 in the Ganin and Lempitsky paper). The objective of training is to teach this common sub-network to produce domain-agnostic features. In other words, it learns how to remove domain-specific traces while maintaining the accuracy on the main task (and minimizing the main loss such as recognition accuracy). The additional sub-network, in contrast, learns how to distinguish domain traces no matter how small they are.

There are two recent ICASSP/Interspeech 2019 papers that use adversarial training that I would like to highlight. In Meng, Zhong, Jinyu Li, and Yifan Gong. "Adversarial speaker adaptation." ICASSP 2019 use an adversarial loss to perform speaker adaptation. They do it with a classic hybrid HMM-RNN system. In a hybrid system, there is a separate acoustic model that predicts a distribution of senones (basically context-dependent phones) from a speech-frame. Because the speaker-dependent data is scarce, a speaker-dependent (SD) model can overfit easily. The standard approach to deal with this is the so-called KL-divergence (KLD) training. It is a regularization technique, which "forces" senone distribution produced by the SD model to be close to the distribution of the speaker-independent (SI) model. The explored alternative to KLD training is the adversarial training with gradient reversal. Overall, the paper achieves good gains in the order of 10% over the SI model (due to adaptation). The baseline KLD training allows to outperform the SI model by 5%. Thus, adversarial training leads to an additional 5% accuracy gain compared to the KLD training.

In Liu, Bin, et al. "Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition." Interspeech 2019, the authors argue that various speech-enhancement front-ends introduce distortions. This is claimed to affect end-to-end speech systems quite a bit. They use an adversarial loss, albeit without gradient reversal to fix this. They achieve a small gain compared to re-training the system on "noisified" training data (apparently using additive noise). However, the overall character error rates of about 50% on the noisy data seem to be a bit too high. One may need more drastic solutions to resolve the problem.

Similar to enhancing low-resolution images (aka superresolution), one can use GANs for speech enhancement and noise reduction. Although this may not necessarily improve speech models, it can still improve a perception of sound by humans. I would like to highlight the following two papers:

  1. SEGAN: Speech Enhancement Generative Adversarial Network,” in Proc. INTERSPEECH, 2017
  2. A Qualcomm Technologies paper: Li, Sen, et al. "Speech bandwidth extension using generative adversarial networks." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

The Speech Enhancement GAN (SEGAN) paper operates directly on the waveform. To train and test SEGAN authors add artificial (apparently, additive) noise to an existing clean set. Thus, SEGAN is trained on the parallel data to translate a "noisified" version of the signal into the original, i.e., clean signal. For both objective and subjective (i.e., human perception) tests SEGAN outperformed the Wiener filtering baseline.

The Qualcomm paper by Li, Sen et al. applies conditional GANs to the bandwidth extension: A certain transmission medium may truncate high frequencies to save bandwidth, which results in audio quality loss. There are a number of techniques to restore missing frequencies. Sen Li et al. apply the GAN to line spectral frequencies and outperform classic non-neural approaches. Sen Li et al. mention that deep neural networks were applied to this problem in the past: Wang, Yingxue, et al. "Speech bandwidth expansion based on deep neural networks", Interspeech 2015, but they do not compare to the work of Wang et al 2015. They use, however, their approach to pre-train the GAN generator.

Regular conditional GANs require parallel data (also called paired data) for training. There are attempts to apply the above mentioned CycleGAN approach in the scenario when such data does not exist. In particular, Fang, Fuming, et al. 2018 apply CycleGAN to male-to-female (and vice versa) foice conversion. Zhao, Shengkui, et al. 2019 apply CycleGAN to train the speech recognition system. The paper uses a hybrid and not end-to-end ASR system. If I read their Table 1 correctly, they achieve modest (order of 5%) gains compared to models that use similarly complex acoustic modelling networks. However, they do not compare to training on noise-enhanced data!

In their ICASSP 2018 paper "Exploring speech enhancement with generative adversarial networks for robust speech recognition" Donahue and et al carry out a more comprehensive evaluation, where they compare GANs against some of the best data augmentation techniques, which include both additive and reverberant noise. As a source of reverberant noise they the above-mentioned room simulator technique. A combination of the additive and reverberant noise training is referred to as the multi-style training (MTR).

Donahue et al first trained a previously mentioned SEGAN model to enhance audio and remove noise (i.e., using a GAN as a frontend). This approach was not beneficial. Then, they explored a frequency-domain SEGAN (FSEGAN), which operates on spectral features, namely, log-Mel filterbanks. Compared to applying the GAN to raw audio, FSEGAN requires less computation and is claimed to be more resistant against reverberant noise. The classic filter banks aim to mimic a nonlinear human ear perception of sound, which is more discriminative at lower frequencies. In contrast, Donahue et al use a large number of equally-spaced filters.

Although conditional GANs can have a latent vector variable $z$, which can be used to sample several outputs given a single input, authors find that generators learn to largely ignore $z$. In fact, removing $z$ improves quality and turns the conditional GAN into a fully deterministic "image translator". The overall loss function is a combination of the adversarial loss and the L1-reconstruction loss.

Donahue et al run experiments with an artificially noisified variant of the famous WSJ corpus and create both training and testing sets with added noise. Perhaps surprising to some readers, the model trained on clean data performs abysmally poor on the noisy variant with word error rate (WER) increasing from 11.9% to 72.2%! They further experiment with a speech enhancer trained with an adversarial GAN loss and with a simple reconstruction loss (L1). Speech enhancing models are used in two ways: First they are used as a speech-enhancing front-end that takes original noisy audio features and "cleans/enhances" them. Then, the original speech recognition model is used. This does not require re-training of the speech system. Second, the speech enhancer is used to produce an additional set of features. This requires re-training the model.

When SEGAN is used as a front-end it does not help and the WER becomes as large as 80%. The front-end based on the proposed FSEGAN does reduce WER to 33%. However, simply re-training the model on the noisy training data produces a much better WER of 20%. When speech enhancers are used to augment original features, it becomes possible to beat training on the noisy data. However, the GAN-based enhancer performs worse than the enhancer based on L1 reconstruction loss. It seems that an idea of using a GAN-based enhancing front-end is quite popular in the computer vision domain. However, many papers, see, e.g., this recent publication do not compare against enhancers trained with a simpler reconstruction loss.

Teacher-student approaches and self-training

There are a number of training and adaptation techniques, which use a teacher-student approach. In this approach, an output of one model, i.e., a teacher, is used to train another model, i.e., a student. Two common teacher-student approaches consist in learning from (1) the teacher-provided labels and (2) the teacher-provided distribution of classes (by minimizing the KL-divergence between the distributions). In the speech recognition context, this approach seems to have been applied only to hybrid HMM-NN systems, which, as I mentioned before, have a separate acoustic model that predicts a distribution of senones (basically context-dependent phones) from a speech-frame.

The teacher-student approach has become well-known after Hinton et al. published a paper on knowledge distillation. However, the technique is much older. In modern history, it was published earlier by Li et al.: J. Li, R. Zhao, J.-T. Huang and Y. Gong, “Learning small-size DNN with output-distribution-based criteria,” in Proc. Interspeech, 2014. If we dig a bit deeper, we discover that the teacher-student approach was a hot topic sixty years ago:

  1. Probability of error of some adaptive pattern-recognition machines H Scudder - IEEE Transactions on Information Theory, 1965.
  2. Spragins, John. "Learning without a teacher." IEEE transactions on information theory 12.2 (1966): 223-230.

An approach related to the student-teacher learning is self-training. In this case, the model is used to label/classify a large amount of data that does not have human annotations. This labeling is then used to train a new model. Thus, the model itself is its own teacher! It is not quite clear why this approach is useful: I suspect it is some form of the regularization. Lo and behold, there is a recent theoretical paper seemingly supporting my hunch.

In any case, there are multiple reports that a self-training (self-distillation) approach can work pretty well. In particular, Amazon claims to have achieved 11-13% reduction in the WER after self-training on as much as one million hours of speech: Parthasarathi, Sree Hari Krishnan, and Nikko Strom. "Lessons from building acoustic models with a million hours of speech." ICASSP 2019. In this work, authors employ an HMM-LSTM hybrid system. They first label a lot of data without transcripts and select hypotheses with high confidence scores. Then, they re-train the acoustic model on these labels. The teacher-student approach has also been growing in popularity as an adaptation technique. I spotted quite a few recent adaptation papers that rely on it:

  1. Li, Jinyu, et al. "Developing far-field speaker system via teacher-student learning." ICASSP 2018.
  2. Tan, Tian, Yanmin Qian, and Dong Yu. "Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition." ICASSP 2018.
  3. Kim, Jaeyoung, Mostafa El-Khamy, and Jungwon Lee. "Bridgenets: Student-teacher transfer learning based on recursive neural networks and its application to distant speech recognition." ICASSP 2018.
  4. Suzuki, Takahito, et al. "Knowledge Distillation for Throat Microphone Speech Recognition." Interspeech 2019.

In particular, in the first paper. Li et al 2018 first label close-speech data using an accurate model. Then, they replay the close-talk audio with an artificial mouth through the air. Thus, they get parallel far-field data. Because the simulated data should largely retain the close-speech phone timestamps, one can train a student model to mimic senone distribution of a teacher model.

Concluding remarks

The objective of my literature review was to understand the value of GANs as data augmentation and enhancing techniques, especially, for the speech recognition domain. However, I have found no strong evidence that they can outperform traditional data augmentation methods such as the acoustic room simulator. They may still be useful as speech enhancers for humans, albeit it is not clear to me if they consistently outperform enhancers trained with a reconstruction loss.

I also note that there might be some confusion between a "true" GANs (that have separate discriminator and generator networks) and a more generic concept of adversarial training. The latter can be simply a multi-task training procedure where one of the tasks consists in predicting a domain (possibly with reversing gradients). Unconditional GANs always have a latent random variable that permits sampling. This might allow us to generate more training data. However, such training data is not sufficiently realistic for the purpose of training a machine learning model.

Furthermore, in many applications we use a conditional GAN (which is basically a translation network) without a latent random variable, because it works better this way. There may be a modest gain by enhancing original, potentially noisy, features with a GAN-based enhancer. However, a comparable effect can be achieved by training an enhancer with a simpler reconstruction loss (and without the discriminator network). Doing so usually requires a parallel data set, e.g., a set of noisy speech aligned with the clean speech, but, quite interestingly, a reconstruction loss can still be useful even in the absence of paired data (in particular, for a speaker conversion). When parallel data is not available, one might also benefit from using a CycleGAN that can learn a domain translation model without any aligned data. Unfortunately, there is only limited evidence that such an approach is useful (and it is not clear if it can outperform training with the room simulator).

Acknowledgements

I thank Desh Raj for the discussion/references related to Transformer architectures and Arya McCarthy for the discussion/references of TTS-based augmentation.

If you find this mini-survey useful, feel free to cite it:

@misc{Boytsov2020_04_01,
title={A brief overview of classic and adversarial data augmentation techniques for speech recognition},
url={http://searchivarius.org/blog/data_augm_gan_2020},
author={Boytsov, Leonid},
year={2020},
month={Apr}}



MNIST is super easy and few people know it!

One can never be too surprised by the phenomenal success of the MNIST dataset, which is used in so many image publications. But do people realize how easy this dataset is? One clear measure of hardness is performance of a simplistic k-NN classifier with vanilla L2 metric directly on pixels. As a variant: performance of the k-NN classifier with some basic unsupervised transformations such as the principal component analysis (PCA) or denoising.

I created a small poll to assess what people think about MNIST's k-NN search accuracy. I thank everybody for participation: Fortunately, more than one hundred people responded (most of them are machine learning practitioners and enthusiasts I assume). So, I think the results are rather reliable.

In summary, nearly 40% of the respondents think that the accuracy would be at most 80%, 45% think the accuracy is 95%. Unfortunately, I did not create the option for 90%. I think it would have had quite a few responses as well. That said the vanilla k-NN search on pixels has 97% accuracy and the combination of the PCA and the k-NN classifier has nearly 98% accuracy (here is a notebook to back up 98% claim.). In fact, with a bit of additional pre-processing such as deskewing and denoising, one can get a nearly 99% accuracy.

Turns out that few people realize how effective the k-NN classifier is on MNIST: only 17% voted for 98%. That said, it does not mean that the k-NN classifier is such a good method overall (it can be good for tabular data, see, e.g., this paper by Shlomo Geva, but not for complex image data, check, e.g., out numbers for CIFAR and IMAGENET). It means, however, that MNIST is very easy. Understandably, people need some toy dataset to play and quickly get results with. One better alternative is the fashion MNIST. However, it is not too hard either. A vanilla k-NN classifier has about 85% accuracy and it is probably possible to push the accuracy close to 90% with a bit of preprocessing. Thus, we may need a comparably small, but much more difficult dataset to replace both of them.



Pages

Subscribe to RSS - blogs