Leonid Boytsov's data sets | searchivarius.org

Document identifier data set: an alternative version of ClueWeb09 gaps data set + Gov2 gaps.
Datasets provided with the Non-Metric Space Library. This bundle includes, among other data files, TFxIDF sparse vectors for the whole English Wikipedia.
Datasets used in my papers on approximate dictionary search methods. They include most frequent words from ClueWeb09 (category B), synthetic English and Russian words, as well as DNA sequences extracted from the human genome.
ClueWeb09 gaps. The ClueWeb09 Gap Data set represents posting lists extracted from the category B html files of the ClueWeb09 collection, provided by the Carnegie Mellon University. The category B contains 50 million English pages. Each posting list is encoded as a sequence of gaps, i.e., differences between adjacent document numbers (IDs).
ClueWeb09 tokens. The ClueWeb09 Tokens Data set represents tokens extracted from the category B html files of the ClueWeb09 collection, provided by the Carnegie Mellon University. The category B contains 50 million English pages.

You are here