English»Data Sets and State-of-the-art (SOTA)»Text Mining and NLP | searchivarius.org
log in | about 
 



Catalogs/lists
 

Metaphor detection
 

Question Answering (QA)
 

Sentiment Analysis and Opinion Mining
 

  


A downloadable test collection of tweets  Royal Sequiera, Jimmy Lin
A Multilingual Corpus of Automatically Extracted Relations from Wikipedia by Google Research  
bAbi project   - tasks for testing text understanding and reasoning.
CARD-660   - a challenging, yet reliable, benchmark for the evaluation of subword and rare word representation techniques.
ClausIE   - Clause-Based Open Information Extraction
CoNLL-2005 Shared Task: Semantic Role Labeling: Data & Software  
CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages  
Corpus of Contemporary American English  
Europarl Parallel Corpus  
Free language lessons for computers: some Google NLP data sets  
Freebase Annotations of the ClueWeb Corpora, v1 (FACC1)  
GeoNames   - geographical information
Global Terrorism Database (GTD)  
Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)  
Google WebTreebank (English Web Treebank)   - over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure.
Groningen Meaning Bank   - A free semantically annotated corpus that anyone can edit.
Internet Argument Corpus  
MCTest   - freely available set of 660 stories and associated questions intended for research on the machine comprehension of text.
MedTag - Annotated Corpora   - multiple corpora developed at the NCBI and used for research on gene tagging, part of speech tagging, sentence segmentation, and other analysis of medical text, primarily from MEDLINE.
Microsoft Concept Graph  
Microsoft Research Paraphrase Phrase Tables  
MRC Psycholinguistic Database Machine Usable Dictionary  
NIH Word Sense Disambiguation (WSD) Test Collection  
NIST TAC Knowledge Base Population (KBP2014) Entity Linking Track  
NLP-progress   - progress in NLP tasks.
Open Academic Graph  
PPDB: The Paraphrase Database  
Relation extraction corpus   - Manually annotated by humans (provided by Google). See, also a Google's blog entry.
Syntactic Ngrams over Time  
Textual Entailment Resource Pool  
The Language Goldmine   - links to hundreds linguistic databases and datasets.
The Natural Language Decathlon   - The challenging data set that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, relation extraction, goal-oriented dialogue, database query generation, and pronoun resolution.
The Stanford Natural Language Inference (SNLI) Corpus  
Timebank and other TimeML corpora  
United Nations parallel corpora  
Universal Dependencies  
Visual Genome   - an ongoing effort to connect structured image concepts to language
Webis-QSpell-17   - Webis Query Spelling Corpus 2017 contains 54,772 web queries that were manually spell-checked; for 9,171 queries alternative spelling variants are contained.
WiC: The Word-in-Context Dataset   - benchmark for the evaluation of context-sensitive word embeddings.
WOCHAT data   - chat sessions