PREAMBLE:When dealing with retrieval, I have been traditionally using TREC NIST evaluation tools (trec_eval and gdeval) for information retrieval. Despite these tools are old, there has been a good amount of effort invested into making them right. Unfortunately, you have to call them as an external tool. Your program forks and runs out of memory. Despite Linux fork is lazy and does not really copy memory, it still happens. It happens even if you use the posix_spawn function and Python's spawn-type creation of new processes: multiprocessing.set_start_method('spawn')

The issue: I decided to switch to scikit-learn or a similarly-interface code (e.g., MatchZoo classes) to compute the IR metrics. I cross-compared results and I have come to the conclusion that very likely all scikit-learn-like packages are fundamentally broken when it comes to computing the mean average precision (MAP) and the normalized discounted cumulative gain NDCG

To compute both of the metrics, one needs two things:

1. The list of relevant documents, where the relevance label can be binary or graded
2. The list of scored/ranked documents.

Ideally, an evaluation tool could ingest this data directly. However, sklearn and other libraries cut the corner by accepting two arrays: y_score and y_true. Effectively each document is paired with its relevance grade, see, e.g., scikit-learn MAP.

Unfortunately, such an evaluation ignores all relevant documents, which are not returned by the system. In that, both NDCG and MAP have a normalizing factor that depends on the number of relevant documents. For example, in my understanding, if your system finds only 10% of all relevant documents, the scikit-learn MAP would produce a 10x larger MAP score compared to NIST trec_eval (and the Wikipedia formula). NDCG is still affected by this issue but to a lesser degree, because scores for omitted relevant documents will be heavily discounted.

I have created the notebook to illustrate this issue using one-query example and the MAP metric. By the way, for some reason, scikit-learn refuses to compute NDCG on this data and fails with a weird error.

Related reading: MAP is often (but not always) a meaningless metric if you do intrinsic evaluation of the k-NN search.