English»Data Sets and State-of-the-art (SOTA)»Text Mining and NLP

Blog

Directory

Catalogs/lists

Metaphor detection

Paraphrasing

Question Answering (QA)

Sentiment Analysis and Opinion Mining

Summarization

A downloadable test collection of tweets Royal Sequiera, Jimmy Lin

A Multilingual Corpus of Automatically Extracted Relations from Wikipedia by Google Research

bAbi project - tasks for testing text understanding and reasoning.

CARD-660 - a challenging, yet reliable, benchmark for the evaluation of subword and rare word representation techniques.

ClausIE - Clause-Based Open Information Extraction

CoNLL-2005 Shared Task: Semantic Role Labeling: Data & Software

CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages

Corpus of Contemporary American English

Europarl Parallel Corpus

Free language lessons for computers: some Google NLP data sets

Freebase Annotations of the ClueWeb Corpora, v1 (FACC1)

GeoNames - geographical information

Global Terrorism Database (GTD)

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

Google WebTreebank (English Web Treebank) - over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure.

Groningen Meaning Bank - A free semantically annotated corpus that anyone can edit.

Internet Argument Corpus

MCTest - freely available set of 660 stories and associated questions intended for research on the machine comprehension of text.

MedTag - Annotated Corpora - multiple corpora developed at the NCBI and used for research on gene tagging, part of speech tagging, sentence segmentation, and other analysis of medical text, primarily from MEDLINE.

Microsoft Concept Graph

MRC Psycholinguistic Database Machine Usable Dictionary

NIH Word Sense Disambiguation (WSD) Test Collection

NIST TAC Knowledge Base Population (KBP2014) Entity Linking Track

NLP-progress - progress in NLP tasks.

Open Academic Graph

Relation extraction corpus - Manually annotated by humans (provided by Google). See, also a Google's blog entry.

Syntactic Ngrams over Time

Textual Entailment Resource Pool

The Language Goldmine - links to hundreds linguistic databases and datasets.

The Massively Multilingual Image Dataset (MMID) - Words and their images in 100 languages

The Stanford Natural Language Inference (SNLI) Corpus

Timebank and other TimeML corpora

United Nations parallel corpora

Universal Dependencies

Visual Genome - an ongoing effort to connect structured image concepts to language

Webis-QSpell-17 - Webis Query Spelling Corpus 2017 contains 54,772 web queries that were manually spell-checked; for 9,171 queries alternative spelling variants are contained.

WiC: The Word-in-Context Dataset - benchmark for the evaluation of context-sensitive word embeddings.

WOCHAT data - chat sessions