 |
|
|
|
 |
 |
 |
 |
 |
 |
 |
Submitted by srchvrs on Mon, 04/19/2021 - 10:42
I just finished listening to the book "AI Superpowers: China, Silicon Valley, and the new World Order. It was written by a leading scientist and technologist Kai-Fu Lee, who under the supervision of the Turing Award winner Raj Reddy created one of the first continuous speech recognition systems. He then held executive positions at several corporations including Microsoft and Google. I largely agree with Kai-Fu Lee assessment of China's potential, but it is hard to agree with his assessment of AI. The book was written at the peak of deep learning hype and it completely ignores shortcomings of deep learning such as poor performance on long tail samples, adversarial samples, or samples coming from a different distribution: It is not clear why these important issue is omitted. As we realise now, a "super-human" performance on datasets like Librispeech or Imagenet does show how much progress we have made, but it does not directly translate into viable products. For example, it is not hard to see that current dictation systems are often barely usable and the speech recognition output often requires quite a bit of post-editing.
Given the overly optimistic assessment of deep learning capabilities, it is somewhat unsurprising Kai-Fu Lee suggests that once AI is better than humans we should turn into a society of compassionate caregivers and/or social workers. I agree that a large part of the population could fill these roles, which are important and should be well paid! But I personally dream about a society of technologists, where at least 20-50% of the population are scientists, engineers, and tinkerers who have intellectually demanding (or creative) jobs. Some say it would be impossible, but we do not really know. A few centuries ago, only a small fraction of the population were literate: Now nearly everybody can read and write. Very likely our education system has huge flaws starting from pre-school and ending up at the PhD level, which works as a high-precision but low-recall sieve that selects the most curious, talented, and hardworking mostly from a small pool of privileged people. I speculate we can do much better than this. In all fairness, Kai-Fu Lee does note that AI may take much longer to deploy. However, my impression is that he does not consider this idea in all seriousness. I would reiterate that the discussion about the difficulties of applying existing tech to real world problems is nearly completely missing.
Although it is a subject of hot debates and scientific scrutiny alike, I think the current AI systems are exploiting conditional probabilities rather than doing actual reasoning. Therefore, they perform poorly on long tail and adversarial samples or samples coming from a different distribution. They cannot explain and, most importantly, reconsider their decisions in the presence of extra evidence (like smart and open-minded humans do). Kai-Fu Lee on multiple occasions praises an ability of deep learning systems to capture non-obvious correlations. However, in many cases these correlations are spurious and are present only in the training set.
On the positive side, Kai-Fu Lee seems to care a lot about humans whose jobs are displaced by AI. However, as I mentioned before, he focuses primarily on the apocalyptic scenario where machines are rapidly taking over the jobs. Thus, he casually discusses an automation of a profession as tricky as software engineering, whereas in reality it is difficult to fully replace even truckers (despite more than 30 years of research on autonomous driving). More realistically, we are moving towards a society of computer-augmented humans, where computers perform routine tasks and humans set higher-level goals and control their execution. We have been augmented by (first simple) and now by very sophisticated tools for hundreds of thousands years already, but the augmentation process has accelerated recently. It is, however, still very difficult for computers to consume (on their own) raw and unstructured information and convert it into the format that simple algorithms can handle. For example, a lot of mathematics may be automatable once a proper formalization is done, but formalization seems to be a much more difficult process compared to finding correlations in data.
In conclusion, there are a lot of Kai-Fu Lee statements that are impossible to disagree with. Most importantly, China is rapidly becoming a scientific (and AI) powerhouse. In that, there has been a lot of complacency in the US (and other Western countries) with respect to this change. Not only is there little progress in improving basic school education and increasing the spending on fundamental sciences, but the competitiveness of US companies has been adversely affected by regressive immigration policies (especially during the Trump presidency). True that the West is still leading, but China is catching up quickly. This is especially worrisome given a recent history of bullying neighboring states. The next Sputnik moment is coming and we better be prepared.
Submitted by srchvrs on Thu, 04/15/2021 - 13:17
Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carried out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot evaluation mode and improves reliability of our results. Furthermore, since source datasets licences often prohibit commercial use, we compare transfer learning to training on pseudo-labels generated by a BM25 scorer. We find that training on pseudo-labels—possibly with subsequent fine-tuning using a modest number of annotated queries—can (sometimes) produce a competitive or better model compared to transfer learning. I am quite happy our study is accepted for presentation at SIGIR 2021.
We have tried to answer several research questions related to the usefulness of transfer learning and pseudo-labeling in the small and big data regime. It was quite interesting to verify the pseudo-labeling results of a now well-known paper Dehghani, Zamani, and friends "Neural ranking models with weak supervision," where they showed that training a student neural network using BM25 as a teacher model allows one to greatly outperform BM25. Dehghani et al. trained a pre-BERT neural model using an insane amount of computation. However, we thought a BERT-based model, which is already massively pre-trained, could be fine-tuned more effectively. And, indeed, on all the collections we were able to outperform BM25 in just a few hours. However, the gains were rather modest: 5-15%.
In that, we find that transfer-learning has a mixed success, which is not totally unsurprising due to a potential distribution shift: Pseudo-labeling, in contrast, uses only in-domain text data. Even though transfer learning and/or pseudo-labeling can be both effective, it is natural to try improving the model using a small number of available in-domain queries. However, this is not always possible due to a "A Little Bit Is Worse Than None" phenomenon, where training on small amounts of in-domain data degrades performance. Previously it was observed on Robust04, but we confirm it can happen elsewhere as well. Clearly, future work should focus on fixing this issue.
We also observe that beating BM25 sometimes requires quite a few queries. Some other groups obtained better results in training/fine-tuning a BERT-based model using a few queries from scratch (without using a transferred model). One reason why this might be the case is that our collections have rather shallow pools of judged queries (compared to TREC collections): MS MARCO has about one positive example per query and other collections have three-four. Possibly, few-shot training can be improved with a target corpus pre-training. We have found, though, that target corpus pre-training is only marginally useful in the full-data regime. Thus we have not used it in the few-data regime. In retrospect, this could have made a difference and we need to consider this option in the future, especially, IR-specific pre-training approaches such as PROP. Finally, it was also suggested to compare fine-tuning BERT with fine-tuning a sequence-to-sequence model as the latter may train more effectively in the small-data regime.
Submitted by srchvrs on Thu, 03/04/2021 - 21:52
We studied the utility of the lexical translation model (IBM Model 1) for English text retrieval, in particular, its neural variants that are trained end-to-end. I am quite happy that our study is going to be presented at ECIR 2021. Using traditional and/or neural Model 1 we produced best neural and non-neural runs on the MS MARCO document ranking leaderboard in late 2020. Also, at the moment of writing this blog post, our BERT-Model1 submission holds the second place. Besides leaderboarding, we have made several interesting findings related to efficiency, effectiveness, and interpretability, which we describe below. Of course, getting strong results requires more than a good architecture, but we find it interesting that some of the top submissions can be achieved using a partially interpretable model.
First of all, given enough training data the traditional, i.e., non-neural, IBM Model 1 can sufficiently boost performance of a retrieval system: Using the traditional Model 1, we produced the best traditional run on the MS MARCO leaderboard in 2020/12/06. However, the non-neural Model 1 does not work very well when queries are much shorter than respective relevant documents. We suspect this was the main reason why this model was not used much by the retrieval community in the past.
We can, nevertheless, come up with an effective neural parametrization of this traditional model, which leads to a substantial improvement on MS MARCO data (for both passage and document retrieval). Furthermore, the resulting context-free neural Model 1 can be pruned: As a result we get a sparse matrix of conditional probabilities. Sparsification does not decrease accuracy, but the sparsified model can run on CPU thousands of times faster compared to a BERT-based ranker. This model can improve performance of the candidate-generation stage without expensive index-time precomputation and query-time manipulation with large tensors. We are not aware of any other neural re-ranking model that has this nice property.
A neural Model 1 can also be used as an aggregator layer on top of contextualized embeddings produced by BERT. This layer is quite interpretable: BERT-Model1 generates a single similarity score for every pair of a query and a document token, which can be interpreted as a conditional translation probability. Then these scores are combined using a standard product-of-sum formula:
$$
P(Q|D)=\prod\limits_{q \in Q} \sum\limits_{d \in D} T(q|d) P(d|D),
$$
where $Q$ is a query and $q$ is a query token. $D$ is a document and $d$ is a document token. Although more studies are needed to verify this hypothesis: Yet, we think having an interpretable layer can be useful for model debugging. In any case, this layer has a better interpretability compared to prior work, which uses a kernel-based formula by Xiong et al. to compute soft-match counts over contextualized embeddings. Because each pair of query-document tokens produces several soft-match values corresponding to different thresholds, it is problematic to aggregate these values in an explainable way.
In conclusion, we note that this partial interpretability comes virtually for free. It does not degrade efficiency or accuracy. In fact, BERT-Model 1 has slightly better accuracy compared to a vanilla BERT (monoBERT) that makes predictions on truncated documents. This small accuracy gain, however, was likely key to obtaining strong results on MS MARCO.
Submitted by srchvrs on Wed, 03/03/2021 - 21:02
It was both sad and enlightening to (virtually) attend a memorial honoring the former Language Technologies Institute director and founder Jaime Carbonell, who untimely passed away one year ago. Jaime started working on NLP and machine translation when very few people believed these tasks were doable. It was a risky move, which required a lot of courage, foresight, energy, not to mention scientific, and organizational talent. He had it all and he had a huge impact on the field as a scientist, advisor, and a leader of an influential language-research institution. I had very few personal interactions with Jaime, but my life was greatly impacted by the institute that Jaime created and that ventured to take an old dude like myself aboard. I hope we, all the former and current students, can become technology and thought leaders Jaime wanted us to be.
Submitted by srchvrs on Thu, 01/28/2021 - 12:17
This was prompted by several recent posts, in particular, by Zach Lipton's tweet, where he complained that all ML culture has been revolving around hacking: "The dominant practice of the applied machine learnist has shifted from ad-hoc feature hacking (2000s) to ad-hoc architecture hacking (2010s) to ad-hoc pre-training hacking (2020s)."
This may seem to be just another (relatively innocent) complaint about the lack of rigor and scholarship in the machine learning field. However, in my opinion, it represents a much bigger issue, namely, a divide between theorists and experimentalists; between tinkerers and scholars. There are very different opinions on both sides of this divide. For example, my friend and co-author Daniel Lemire goes rather far by saying that scholarship is conservative while tinkering is progressive. On the other side, we have people eager to label tinkerers and experimentalists as tech bros or merely engineers.
I do not subscribe to any of these extremes. However, I believe that tinkering has been an essential, if not the primary engine of progress. There is an understandable desire to explain things, which amounts to building a theoretical model of the world. This is clearly super-useful, but there are limitations. First of all, theories are not bullet-proof. They aim to explain experimental data and they evolve over time. One example is a "contest" between Geo- vs Heliocentric systems. At some point, the Geocentric system was better supported by data, despite being wrong (in the modern understanding). Somewhat similarly, we had to amend Newton's physics and we will probably have to make many amendments to existing theoretical models as well.
Second, theories are limited, often to a great extent. One has to make assumptions (which never truly hold in practice) as well as a lot of simplifications. One of my most favorite examples is a theory of locality-sensitive hashing. Another example is parsing in natural language processing (NLP). Parsing is a crucial component of a rule-based (and hybrid) NLP. There was a lot of effort devoted to making parsing effective, in particular, by training (deep) neural networks models to do parsing. Despite being improved by deep learning, parsing is not particularly popular nowadays. One problem, in my opinion, is that linguistic theory behind parsing permits explaining only a limited number of language phenomena. Thus, these theories (so far) have been more useful to debugging existing neural networks rather than for building fully-functional applications such as question-answering or sentiment analysis systems.
In summary, I would emphasize that theory is certainly useful: not only to fully understand the world, but also to provide insights for tinkering. That said, I believe it is and will continue to be limited, so we cannot dismiss tinkering as some sort of inferior approach to do science/engineering. Daniel Lemire also notes that tinkering is dangerous and it is hard to disagree. The dangers need to be mitigated. However, I do not think it is realistic to expect people to wait till fully-formed useful theories appear, in particular, because this depends on tinkerers producing experimental results.
Pages
|
|
 |
 |
 |
 |
 |
 |
 |
|
|