| A Case Study in Web Search using TREC Algorithms Amit Singhal, Marcin Kaszkiel - On importance of anchor text.
|
| A Comparison of Statistical Significance Tests for Information Retrieval Evaluation Mark D. Smucker, James Allan, Ben Carterettу - An evaluation of the permutation/randomization test.
|
| A Statistical Analysis of TREC-3 Data Jean Tague-Sutcliffe, James Blustein - A seminal paper where multiple comparison adjustments were used in IR experiments.
|
| Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes Mark D. Smucker, James Allan, Ben Carterette |
| Bias and the limits of pooling for large collections. Buckley C. E., Dimmick, D. L., Soboroff, I. M., Voorhees E. M. |
| Click Models for Web Search Aleksandr Chuklin, Ilya Markov, Maarten de Rijke |
| Comparing the Sensitivity of Information Retrieval Metrics Filip Radlinski, Nick Craswell |
| Do TREC Web Collections Look Like the Web? Ian Soboroff |
| Evaluating the performance of information retrieval systems using test collections Paul Clough, Mark Sanderson - A survey of test collections and evaluation methods.
|
| Expected Reciprocal Rank for Graded Relevance Olivier Chapelle, Donald Metzler, Ya Zhang, Pierre Grinspan |
| Forming Test Collections with No System Pooling |
| How Reliable are the Results of Large-Scale Information Retrieval Experiments? J Zobel |
| Improvements That Don’t Add Up Timothy G. Armstrong, Alistair Moffat, William Webber, Justin Zobel |
| Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability Mark Sanderson, Justin Zobel |
| Minimal Test Collections for Retrieval Evaluation Ben Carterette, James Allan, Ramesh Sitaraman |
| Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments BENJAMIN A. CARTERETTE |
| Novelty and diversity in information retrieval evaluation Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, Ian MacKinnon |
| Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements Gregory Marton, Alexey Radul |
| On Rank Correlation in Information Retrieval Evaluation Massimo Melucci, University of Padua |
| On the Robustness of Relevance Measures with Incomplete Judgments Tanuja Bompada, Chi-Chao Chang, John Chen Ravi, Kumar Rajesh Shenoy |
| Power and Bias of Subset Pooling Strategies Cormack G.V., Lynam T.R. |
| Predicting Query Performance Steve Cronen-Townsend , Yun Zhou, W. Bruce Croft |
| Quantifying Test Collection Quality Based on the Consistency of Relevance Judgements |
| Ranking Retrieval Systems without Relevance Judgments Ian Soboroff, Charles Nicholas, Patrick Cahan |
| Selecting good expansion terms for pseudo-relevance feedback Guihong Cao, Jian-Yun Nie, Jianfeng Gao, Stephen Robertson |
| Statistical inference in retrieval effectiveness evaluation Jacques Savoy |
| Statistical Power in Retrieval Experimentation William Webber, Alistair Moffat, Justin Zobel |
| Statistical Precision of Information Retrieval Evaluation Gordon V. Cormack, Thomas R. Lynam |
| Test Collection Based Evaluation of Information Retrieval Systems Mark Sanderson |
| TREC: Experiment and Evaluation in Information Retrieval E. M. Voorhees, D. K. Harman. |
| Using Statistical Testing in the Evaluation of Retrieval Experiments David Hull |
| Validity and power of t-test for comparing MAP and GMAP Cormack G.V., Lynam T.R. |