Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carried out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot evaluation mode and improves reliability of our results. Furthermore, since source datasets licences often prohibit commercial use, we compare transfer learning to training on pseudo-labels generated by a BM25 scorer. We find that training on pseudo-labels—possibly with subsequent fine-tuning using a modest number of annotated queries—can (sometimes) produce a competitive or better model compared to transfer learning. I am quite happy our study is accepted for presentation at SIGIR 2021.
We have tried to answer several research questions related to the usefulness of transfer learning and pseudo-labeling in the small and big data regime. It was quite interesting to verify the pseudo-labeling results of a now well-known paper Dehghani, Zamani, and friends "Neural ranking models with weak supervision," where they showed that training a student neural network using BM25 as a teacher model allows one to greatly outperform BM25. Dehghani et al. trained a pre-BERT neural model using an insane amount of computation. However, we thought a BERT-based model, which is already massively pre-trained, could be fine-tuned more effectively. And, indeed, on all the collections we were able to outperform BM25 in just a few hours. However, the gains were rather modest: 5-15%.
In that, we find that transfer-learning has a mixed success, which is not totally unsurprising due to a potential distribution shift: Pseudo-labeling, in contrast, uses only in-domain text data. Even though transfer learning and/or pseudo-labeling can be both effective, it is natural to try improving the model using a small number of available in-domain queries. However, this is not always possible due to a "A Little Bit Is Worse Than None" phenomenon, where training on small amounts of in-domain data degrades performance. Previously it was observed on Robust04, but we confirm it can happen elsewhere as well. Clearly, future work should focus on fixing this issue.
We also observe that beating BM25 sometimes requires quite a few queries. Some other groups obtained better results in training/fine-tuning a BERT-based model using a few queries from scratch (without using a transferred model). One reason why this might be the case is that our collections have rather shallow pools of judged queries (compared to TREC collections): MS MARCO has about one positive example per query and other collections have three-four. Possibly, few-shot training can be improved with a target corpus pre-training. We have found, though, that target corpus pre-training is only marginally useful in the full-data regime. Thus we have not used it in the few-data regime. In retrospect, this could have made a difference and we need to consider this option in the future, especially, IR-specific pre-training approaches such as PROP. Finally, it was also suggested to compare fine-tuning BERT with fine-tuning a sequence-to-sequence model as the latter may train more effectively in the small-data regime.