| A downloadable test collection of tweets Royal Sequiera, Jimmy Lin |
| A Multilingual Corpus of Automatically Extracted Relations from Wikipedia by Google Research |
| bAbi project - tasks for testing text understanding and reasoning.
|
| CARD-660 - a challenging, yet reliable, benchmark for the evaluation of subword and rare word representation techniques.
|
| ClausIE - Clause-Based Open Information Extraction
|
| CoNLL-2005 Shared Task: Semantic Role Labeling: Data & Software |
| CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages |
| Corpus of Contemporary American English |
| Europarl Parallel Corpus |
| Free language lessons for computers: some Google NLP data sets |
| Freebase Annotations of the ClueWeb Corpora, v1 (FACC1) |
| GeoNames - geographical information
|
| Global Terrorism Database (GTD) |
| Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1) |
| Google WebTreebank (English Web Treebank) - over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure.
|
| Groningen Meaning Bank - A free semantically annotated corpus that anyone can edit.
|
| Internet Argument Corpus |
| MCTest - freely available set of 660 stories and associated questions intended for research on the machine comprehension of text.
|
| MedTag - Annotated Corpora - multiple corpora developed at the NCBI and used for research on gene tagging, part of speech tagging, sentence segmentation, and other analysis of medical text, primarily from MEDLINE.
|
| Microsoft Concept Graph |
| MRC Psycholinguistic Database Machine Usable Dictionary |
| NIH Word Sense Disambiguation (WSD) Test Collection |
| NIST TAC Knowledge Base Population (KBP2014) Entity Linking Track |
| NLP-progress - progress in NLP tasks.
|
| Open Academic Graph |
| Relation extraction corpus - Manually annotated by humans (provided by Google). See, also a Google's blog entry.
|
| Syntactic Ngrams over Time |
| Textual Entailment Resource Pool |
| The Language Goldmine - links to hundreds linguistic databases and datasets.
|
| The Massively Multilingual Image Dataset (MMID) - Words and their images in 100 languages
|
| The Stanford Natural Language Inference (SNLI) Corpus |
| Timebank and other TimeML corpora |
| United Nations parallel corpora |
| Universal Dependencies |
| Visual Genome - an ongoing effort to connect structured image concepts to language
|
| Webis-QSpell-17 - Webis Query Spelling Corpus 2017 contains 54,772 web queries that were manually spell-checked; for 9,171 queries alternative spelling variants are contained.
|
| WiC: The Word-in-Context Dataset - benchmark for the evaluation of context-sensitive word embeddings.
|
| WOCHAT data - chat sessions
|