A downloadable test collection of tweets Royal Sequiera, Jimmy Lin |
A Multilingual Corpus of Automatically Extracted Relations from Wikipedia by Google Research |
bAbi project - tasks for testing text understanding and reasoning.
|
CARD-660 - a challenging, yet reliable, benchmark for the evaluation of subword and rare word representation techniques.
|
ClausIE - Clause-Based Open Information Extraction
|
CoNLL-2005 Shared Task: Semantic Role Labeling: Data & Software |
CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages |
Corpus of Contemporary American English |
Europarl Parallel Corpus |
Free language lessons for computers: some Google NLP data sets |
Freebase Annotations of the ClueWeb Corpora, v1 (FACC1) |
GeoNames - geographical information
|
Global Terrorism Database (GTD) |
Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1) |
Google WebTreebank (English Web Treebank) - over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure.
|
Groningen Meaning Bank - A free semantically annotated corpus that anyone can edit.
|
Internet Argument Corpus |
MCTest - freely available set of 660 stories and associated questions intended for research on the machine comprehension of text.
|
MedTag - Annotated Corpora - multiple corpora developed at the NCBI and used for research on gene tagging, part of speech tagging, sentence segmentation, and other analysis of medical text, primarily from MEDLINE.
|
Microsoft Concept Graph |
MRC Psycholinguistic Database Machine Usable Dictionary |
NIH Word Sense Disambiguation (WSD) Test Collection |
NIST TAC Knowledge Base Population (KBP2014) Entity Linking Track |
NLP-progress - progress in NLP tasks.
|
Open Academic Graph |
Relation extraction corpus - Manually annotated by humans (provided by Google). See, also a Google's blog entry.
|
Syntactic Ngrams over Time |
Textual Entailment Resource Pool |
The Language Goldmine - links to hundreds linguistic databases and datasets.
|
The Massively Multilingual Image Dataset (MMID) - Words and their images in 100 languages
|
The Stanford Natural Language Inference (SNLI) Corpus |
Timebank and other TimeML corpora |
United Nations parallel corpora |
Universal Dependencies |
Visual Genome - an ongoing effort to connect structured image concepts to language
|
Webis-QSpell-17 - Webis Query Spelling Corpus 2017 contains 54,772 web queries that were manually spell-checked; for 9,171 queries alternative spelling variants are contained.
|
WiC: The Word-in-Context Dataset - benchmark for the evaluation of context-sensitive word embeddings.
|
WOCHAT data - chat sessions
|