log in | about 
 

Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits

We studied the utility of the lexical translation model (IBM Model 1) for English text retrieval, in particular, its neural variants that are trained end-to-end. I am quite happy that our study is going to be presented at ECIR 2021. Using traditional and/or neural Model 1 we produced best neural and non-neural runs on the MS MARCO document ranking leaderboard in late 2020. Also, at the moment of writing this blog post, our BERT-Model1 submission holds the second place. Besides leaderboarding, we have made several interesting findings related to efficiency, effectiveness, and interpretability, which we describe below. Of course, getting strong results requires more than a good architecture, but we find it interesting that some of the top submissions can be achieved using a partially interpretable model.

First of all, given enough training data the traditional, i.e., non-neural, IBM Model 1 can sufficiently boost performance of a retrieval system: Using the traditional Model 1, we produced the best traditional run on the MS MARCO leaderboard in 2020/12/06. However, the non-neural Model 1 does not work very well when queries are much shorter than respective relevant documents. We suspect this was the main reason why this model was not used much by the retrieval community in the past.

We can, nevertheless, come up with an effective neural parametrization of this traditional model, which leads to a substantial improvement on MS MARCO data (for both passage and document retrieval). Furthermore, the resulting context-free neural Model 1 can be pruned: As a result we get a sparse matrix of conditional probabilities. Sparsification does not decrease accuracy, but the sparsified model can run on CPU thousands of times faster compared to a BERT-based ranker. This model can improve performance of the candidate-generation stage without expensive index-time precomputation and query-time manipulation with large tensors. We are not aware of any other neural re-ranking model that has this nice property.

A neural Model 1 can also be used as an aggregator layer on top of contextualized embeddings produced by BERT. This layer is quite interpretable: BERT-Model1 generates a single similarity score for every pair of a query and a document token, which can be interpreted as a conditional translation probability. Then these scores are combined using a standard product-of-sum formula:

$$
P(Q|D)=\prod\limits_{q \in Q} \sum\limits_{d \in D} T(q|d) P(d|D),
$$

where $Q$ is a query and $q$ is a query token. $D$ is a document and $d$ is a document token. Although more studies are needed to verify this hypothesis: Yet, we think having an interpretable layer can be useful for model debugging. In any case, this layer has a better interpretability compared to prior work, which uses a kernel-based formula by Xiong et al. to compute soft-match counts over contextualized embeddings. Because each pair of query-document tokens produces several soft-match values corresponding to different thresholds, it is problematic to aggregate these values in an explainable way.

In conclusion, we note that this partial interpretability comes virtually for free. It does not degrade efficiency or accuracy. In fact, BERT-Model 1 has slightly better accuracy compared to a vanilla BERT (monoBERT) that makes predictions on truncated documents. This small accuracy gain, however, was likely key to obtaining strong results on MS MARCO.



Remembering Jaime Carbonell

It was both sad and enlightening to (virtually) attend a memorial honoring the former Language Technologies Institute director and founder Jaime Carbonell, who untimely passed away one year ago. Jaime started working on NLP and machine translation when very few people believed these tasks were doable. It was a risky move, which required a lot of courage, foresight, energy, not to mention scientific, and organizational talent. He had it all and he had a huge impact on the field as a scientist, advisor, and a leader of an influential language-research institution. I had very few personal interactions with Jaime, but my life was greatly impacted by the institute that Jaime created and that ventured to take an old dude like myself aboard. I hope we, all the former and current students, can become technology and thought leaders Jaime wanted us to be.



theorists vs experimentalists

This was prompted by several recent posts, in particular, by Zach Lipton's tweet, where he complained that all ML culture has been revolving around hacking: "The dominant practice of the applied machine learnist has shifted from ad-hoc feature hacking (2000s) to ad-hoc architecture hacking (2010s) to ad-hoc pre-training hacking (2020s)."

This may seem to be just another (relatively innocent) complaint about the lack of rigor and scholarship in the machine learning field. However, in my opinion, it represents a much bigger issue, namely, a divide between theorists and experimentalists; between tinkerers and scholars. There are very different opinions on both sides of this divide. For example, my friend and co-author Daniel Lemire goes rather far by saying that scholarship is conservative while tinkering is progressive. On the other side, we have people eager to label tinkerers and experimentalists as tech bros or merely engineers.

I do not subscribe to any of these extremes. However, I believe that tinkering has been an essential, if not the primary engine of progress. There is an understandable desire to explain things, which amounts to building a theoretical model of the world. This is clearly super-useful, but there are limitations. First of all, theories are not bullet-proof. They aim to explain experimental data and they evolve over time. One example is a "contest" between Geo- vs Heliocentric systems. At some point, the Geocentric system was better supported by data, despite being wrong (in the modern understanding). Somewhat similarly, we had to amend Newton's physics and we will probably have to make many amendments to existing theoretical models as well.

Second, theories are limited, often to a great extent. One has to make assumptions (which never truly hold in practice) as well as a lot of simplifications. One of my most favorite examples is a theory of locality-sensitive hashing. Another example is parsing in natural language processing (NLP). Parsing is a crucial component of a rule-based (and hybrid) NLP. There was a lot of effort devoted to making parsing effective, in particular, by training (deep) neural networks models to do parsing. Despite being improved by deep learning, parsing is not particularly popular nowadays. One problem, in my opinion, is that linguistic theory behind parsing permits explaining only a limited number of language phenomena. Thus, these theories (so far) have been more useful to debugging existing neural networks rather than for building fully-functional applications such as question-answering or sentiment analysis systems.

In summary, I would emphasize that theory is certainly useful: not only to fully understand the world, but also to provide insights for tinkering. That said, I believe it is and will continue to be limited, so we cannot dismiss tinkering as some sort of inferior approach to do science/engineering. Daniel Lemire also notes that tinkering is dangerous and it is hard to disagree. The dangers need to be mitigated. However, I do not think it is realistic to expect people to wait till fully-formed useful theories appear, in particular, because this depends on tinkerers producing experimental results.



Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard

A few days ago I launched a traditional IR system into (lower layers of) the Transformer cloud. Although inferior to most BERT-based models, it outperformed several neural submissions (as well as all non-neural ones), including two submissions that used a large pretrained Transformer model for re-ranking.

My objectives were:

  • To provide a stronger traditional baseline;
  • To develop an effective first-stage retrieval system,
    which can be efficient and effective without expensive index-time precomputation.

I have posted a short write-up on arxiv to describe the submitted system. The write-up comes with two notebooks, which can be used to reproduce results.

This work was possible largely due to using our own flexible retrieval toolkit FlexNeuART (intended pronunciation flex-noo-art), which was recently presented at the EMNLP OSS Workshop. FlexNeuART was also instrumental to achieving top spots on the MS MARCO document ranking leaderboard in August and November 2020.



Simple advice for runners

I would like to share a couple of tricks that may make running more pleasurable.

The first trick is obvious in the hindsight, but it took me quite a while to figure it out on my own. Take a wet small piece of cloth, which can fit into a side or chest pocket and wipe your face regularly. For longer runs, you may take more than one piece of cloth. The cloth should be pretty small: I find it is inconvenient to run with big towel-like wipes!

The second trick is concerned with the phone, which needs to be stored somewhere, ideally, where it can be accessed easily. Turns out, that one of the best holders is the so-called Running Buddy. With the Running Buddy , It is easy to open the cover and adjust the volume. It is also quite easy to take the phone in and out of this pouch.

I have been running with these pouches for several years already: They stick very well: If attached properly, they do not fall off. In the unlikely detachment event, it is hard to miss that the pouch is not there anymore. Note that nowadays phones have become very large and you will likely need to buy the largest size possible! I have a couple of these (just in case) and I buy them from the Running Buddy (BTW, I am not affiliated with them in any way!).

Last, but not least: If it is hot, I run without a t-shirt or only with a reflective vest if I run at night. It might offend somebody's feelings (hopefully not), but it is just a very practical way to reduce overheating during the warm season. Somewhat surprisingly, it is not easy to get sunburnt while running. As a side comment, I find the 20C (68F) weather to be pretty uncomfortable for running. I overheat very easily and my ideal running temperature is 5-10 degrees Centigrade (10-20 Fahrenheit) above the freezing point.



Pages

Subscribe to RSS - srchvrs's blog