A downloadable test collection of tweets Royal Sequiera, Jimmy Lin |
A Multilingual Corpus of Automatically Extracted Relations from Wikipedia by Google Research |
bAbi project - tasks for testing text understanding and reasoning.
CARD-660 - a challenging, yet reliable, benchmark for the evaluation of subword and rare word representation techniques.
ClausIE - Clause-Based Open Information Extraction
CoNLL-2005 Shared Task: Semantic Role Labeling: Data & Software |
CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages |
Corpus of Contemporary American English |
Europarl Parallel Corpus |
Free language lessons for computers: some Google NLP data sets |
Freebase Annotations of the ClueWeb Corpora, v1 (FACC1) |
GeoNames - geographical information
Global Terrorism Database (GTD) |
Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1) |
Google WebTreebank (English Web Treebank) - over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure.
Groningen Meaning Bank - A free semantically annotated corpus that anyone can edit.
Internet Argument Corpus |
MCTest - freely available set of 660 stories and associated questions intended for research on the machine comprehension of text.
MedTag - Annotated Corpora - multiple corpora developed at the NCBI and used for research on gene tagging, part of speech tagging, sentence segmentation, and other analysis of medical text, primarily from MEDLINE.
Microsoft Concept Graph |
MRC Psycholinguistic Database Machine Usable Dictionary |
NIH Word Sense Disambiguation (WSD) Test Collection |
NIST TAC Knowledge Base Population (KBP2014) Entity Linking Track |
NLP-progress - progress in NLP tasks.
Open Academic Graph |
Relation extraction corpus - Manually annotated by humans (provided by Google). See, also a Google's blog entry.
Syntactic Ngrams over Time |
Textual Entailment Resource Pool |
The Language Goldmine - links to hundreds linguistic databases and datasets.
The Massively Multilingual Image Dataset (MMID) - Words and their images in 100 languages
The Stanford Natural Language Inference (SNLI) Corpus |
Timebank and other TimeML corpora |
United Nations parallel corpora |
Universal Dependencies |
Visual Genome - an ongoing effort to connect structured image concepts to language
Webis-QSpell-17 - Webis Query Spelling Corpus 2017 contains 54,772 web queries that were manually spell-checked; for 9,171 queries alternative spelling variants are contained.
WiC: The Word-in-Context Dataset - benchmark for the evaluation of context-sensitive word embeddings.
WOCHAT data - chat sessions