ClueWeb09 gaps. The ClueWeb09 Gap Data set represents posting lists extracted from the category B html files of the ClueWeb09 collection, provided by the Carnegie Mellon University. The category B contains 50 million English pages. Each posting list is encoded as a sequence of gaps, i.e., differences between adjacent document numbers (IDs).
ClueWeb09 tokens. The ClueWeb09 Tokens Data set represents tokens extracted from the category B html files of the ClueWeb09 collection, provided by the Carnegie Mellon University. The category B contains 50 million English pages.