English»Data Sets and State-of-the-art (SOTA)»WWW | searchivarius.org
log in | about 

11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts  
Anchor text for ClueWeb09 Category A  Djoerd Hiemstra
Anchor text for ClueWeb12  
AOL query logs  
ClueWeb09   - This data set is used in Text Retrieval Conference. Contains two datasets: A and B. A contains approximately 500 million pages in 10 languages. B is a subset of A, which contains 50 million pages.
ClueWeb09 related data  
ClueWeb12  Jamie Callan et al. - A successor of ClueWeb09
Common Crawl   - 3.8 billion docs, 100 TB+, 61 million domains.
Dictionaries for Linking Text, Entities and Ideas   - Wikipedia titles associated with most common phrases in links pointing to these Wikipedia titles.
Dotbot   - A private company that crawls the Web. Its goal is to make crawled data publicly available.
Finnegan/Quangle   - Synthetic text databases.
Predictive Web Analytics  
TREC FreeBase Queries (Google, Inc)   - Freebase annotations for TREC Million Query Track and Web Track queries.
Web Crawls created by the Laboratory of Web Algorithmics  
Web Data Commons  
WikiLinks: 40 Million Entities in Context   - See an overview in the Google Research blog.