English»Data Sets and State-of-the-art (SOTA)»WWW

Blog

Directory

11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts

Anchor text for ClueWeb09 Category A Djoerd Hiemstra

Anchor text for ClueWeb12

AOL query logs

ClueWeb09 - This data set is used in Text Retrieval Conference. Contains two datasets: A and B. A contains approximately 500 million pages in 10 languages. B is a subset of A, which contains 50 million pages.

ClueWeb09 related data

ClueWeb12 Jamie Callan et al. - A successor of ClueWeb09

Common Crawl - 3.8 billion docs, 100 TB+, 61 million domains.

Dictionaries for Linking Text, Entities and Ideas - Wikipedia titles associated with most common phrases in links pointing to these Wikipedia titles.

Dotbot - A private company that crawls the Web. Its goal is to make crawled data publicly available.

Finnegan/Quangle - Synthetic text databases.

Predictive Web Analytics

TREC FreeBase Queries (Google, Inc) - Freebase annotations for TREC Million Query Track and Web Track queries.

Web Crawls created by the Laboratory of Web Algorithmics

Web Data Commons

WikiLinks: 40 Million Entities in Context - See an overview in the Google Research blog.

Wikipedia