English»Data Sets and Test Collections»Text Mining and NLP | searchivarius.org
log in | contact | about 
 



Metaphor detection
 

Question Answering (QA)
 

Sentiment Analysis and Opinion Mining
 

   


A Multilingual Corpus of Automatically Extracted Relations from Wikipedia by Google Research  
A Survey of Current Datasets for Vision and Language Research  Ferraro F, Mostafazadeh N, Huang TH, Vanderwende L, Devlin J, Galley M, Mitchell M.
Allen AI sets   - includes, among other sets, Aristo project example data sets
bAbi project   - tasks for testing text understanding and reasoning.
ClausIE   - Clause-Based Open Information Extraction
CoNLL-2005 Shared Task: Semantic Role Labeling: Data & Software  
CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages  
Conversation AI data sets  
Corpus of Contemporary American English  
Europarl Parallel Corpus  
Free language lessons for computers: some Google NLP data sets  
Freebase Annotations of the ClueWeb Corpora, v1 (FACC1)  
GeoNames   - geographical information
Global Terrorism Database (GTD)  
Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)  
Google WebTreebank (English Web Treebank)   - over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure.
Groningen Meaning Bank   - A free semantically annotated corpus that anyone can edit.
Internet Argument Corpus  
MCTest   - freely available set of 660 stories and associated questions intended for research on the machine comprehension of text.
MedTag - Annotated Corpora   - multiple corpora developed at the NCBI and used for research on gene tagging, part of speech tagging, sentence segmentation, and other analysis of medical text, primarily from MEDLINE.
Microsoft Concept Graph  
Microsoft Research Paraphrase Phrase Tables  
MRC Psycholinguistic Database Machine Usable Dictionary  
NIH Word Sense Disambiguation (WSD) Test Collection  
NIST TAC Knowledge Base Population (KBP2014) Entity Linking Track  
Open Academic Graph  
PPDB: The Paraphrase Database  
Relation extraction corpus   - Manually annotated by humans (provided by Google). See, also a Google's blog entry.
SensEval & SemEval data  
Syntactic Ngrams over Time  
Textual Entailment Resource Pool  
The Language Goldmine   - links to hundreds linguistic databases and datasets.
The LDC Corpus Catalog (top ten datasets)   - The LDC's Catalog contains hundreds of corpora of language data, including TIPSTER and Google n-gram collection.
The Stanford Natural Language Inference (SNLI) Corpus  
Timebank and other TimeML corpora  
United Nations parallel corpora  
Universal Dependencies  
Visual Genome   - an ongoing effort to connect structured image concepts to language
Webis-QSpell-17   - Webis Query Spelling Corpus 2017 contains 54,772 web queries that were manually spell-checked; for 9,171 queries alternative spelling variants are contained.
WOCHAT data   - chat sessions