The ClueWeb09 Gap Data set represents posting lists extracted from the category B html files of the ClueWeb09 collection, provided by the Carnegie Mellon University. The category B contains 50 million English pages. Each posting list is encoded as a sequence of gaps, i.e., differences between adjacent document numbers (IDs). These gaps are compressed using the variable byte scheme (see decompression & downloading instruction below).
The document numbers (IDs) are assigned in the same order as documents are stored in warc-files, starting from the folder
en0000 and ending with the folder enwp03. Posting lists represent 1M most frequent words (common stop words are excluded: the link to the stop word file). Different grammar forms (including different verb forms) are conflated using the library Lemmatizer. Note that many posting lists are very large and have more than 10M entries.
The ClueWeb09 data set includes almost complete English Wikipedia (files in in folders enwp01, enwp02, enwp03). Correspondingly, one can use gap information that relates only to Wikipedia. To this end, one should simply exclude all the documents with (zero-based) IDs smaller than 44233099. All document with IDs larger than or equal to 44233099 belong to Wikipedia: To compute an ID of a document number n, one should sum up n first gaps.
The data set is in the public domain and can be used for both commercial and academic purposes. The indexing software underwent thorough testing, including unit/testing and comparing results with the outcome of a sequential-search utility. Last, but not least, respective retrieval software performed well in TREC evaluations (see Table 2 srchvrs11b). This would not have been possible if indexing algorithms had serious flaws. Thus, we expect this data to be quite reliable.
Should you decide to use it, please, reference this page and/or the tech report directly related to this data set (BibTex file):
Boytsov, L., Belova, A., 2010. Lessons Learned from Indexing Close Word Pairs. In TREC-19: Proceedings of the Nineteenth Text REtrieval Conference. [PDF]
Download and decompression instructions
Note: instructions are given for users of Unix/Linux systems (bash, wget, and, 7-zip. are required). The data set is divided into several parts (total 8+ Gb), which can be downloaded using the following script. When all files are downloaded, they should be unpacked using the following script. Note that both the download and uncompress scripts also verify MD5 sums.
The file dict represents offsets of compressed postings, while inv
keeps postings themselves. To extract/uncompress gaps one can use the following software (see enclosed README for details). After the whole data set is converted using the utility toflat, one obtains a file with the following size and MD5 sums:
- Size: 54897549640 bytes
- MD5: 1d8913e6935bfe708c5a744a6f405848
All MD5 values can also be found in this file.
In 2013, we updated and expanded the gap data set. This new version can be found here.