The ClueWeb09 Tokens Data set represents tokens extracted from the category B html files of the ClueWeb09 collection, provided by the Carnegie Mellon University. The category B contains 50 million English pages.
These tokens are stored in a plain file in the order of decreasing frequency. The number of tokens is about 25 million. I included only those that occur at least 3 times. The following file contains information on token frequencies. Each line has the following format:
[token frequency] [the first line in the dictionary file that contains a token with the given frequency]
The data set is in the public domain and can be used for both commercial and academic purposes. The indexing software underwent thorough testing, including unit/testing and comparing results with the outcome of a sequential-search utility. Last, but not least, respective retrieval software performed well on TREC: our run had the mean average precision of 0.1498, which was, perhaps, the best among category B runs (see Table 2). This would not have been possible if indexing algorithms had serious flaws. Thus, we expect this data to be quite reliable.
Should you decide to use it, please, reference this page and/or the tech report directly related to this data set (BibTex file):
Boytsov, L., Belova, A., 2010. Lessons Learned from Indexing Close Word Pairs. In TREC-19: Proceedings of the Nineteenth Text REtrieval Conference. [PDF]