We studied the utility of the lexical translation model (IBM Model 1) for English text retrieval, in particular, its neural variants that are trained end-to-end. I am quite happy that our study is going to be presented at ECIR 2021. Using traditional and/or neural Model 1 we produced best neural and non-neural runs on the MS MARCO document ranking leaderboard in late 2020. Also, at the moment of writing this blog post, our BERT-Model1 submission holds the second place. Besides leaderboarding, we have made several interesting findings related to efficiency, effectiveness, and interpretability, which we describe below. Of course, getting strong results requires more than a good architecture, but we find it interesting that some of the top submissions can be achieved using a partially interpretable model.
First of all, given enough training data the traditional, i.e., non-neural, IBM Model 1 can sufficiently boost performance of a retrieval system: Using the traditional Model 1, we produced the best traditional run on the MS MARCO leaderboard in 2020/12/06. However, the non-neural Model 1 does not work very well when queries are much shorter than respective relevant documents. We suspect this was the main reason why this model was not used much by the retrieval community in the past.
We can, nevertheless, come up with an effective neural parametrization of this traditional model, which leads to a substantial improvement on MS MARCO data (for both passage and document retrieval). Furthermore, the resulting context-free neural Model 1 can be pruned: As a result we get a sparse matrix of conditional probabilities. Sparsification does not decrease accuracy, but the sparsified model can run on CPU thousands of times faster compared to a BERT-based ranker. This model can improve performance of the candidate-generation stage without expensive index-time precomputation and query-time manipulation with large tensors. We are not aware of any other neural re-ranking model that has this nice property.
A neural Model 1 can also be used as an aggregator layer on top of contextualized embeddings produced by BERT. This layer is quite interpretable: BERT-Model1 generates a single similarity score for every pair of a query and a document token, which can be interpreted as a conditional translation probability. Then these scores are combined using a standard product-of-sum formula:
$$
P(Q|D)=\prod\limits_{q \in Q} \sum\limits_{d \in D} T(q|d) P(d|D),
$$
where $Q$ is a query and $q$ is a query token. $D$ is a document and $d$ is a document token. Although more studies are needed to verify this hypothesis: Yet, we think having an interpretable layer can be useful for model debugging. In any case, this layer has a better interpretability compared to prior work, which uses a kernel-based formula by Xiong et al. to compute soft-match counts over contextualized embeddings. Because each pair of query-document tokens produces several soft-match values corresponding to different thresholds, it is problematic to aggregate these values in an explainable way.
In conclusion, we note that this partial interpretability comes virtually for free. It does not degrade efficiency or accuracy. In fact, BERT-Model 1 has slightly better accuracy compared to a vanilla BERT (monoBERT) that makes predictions on truncated documents. This small accuracy gain, however, was likely key to obtaining strong results on MS MARCO.