English»Data Sets and State-of-the-art (SOTA)»Text Mining and NLP»Question Answering (QA)

Blog

Directory

30M Factoid Question-Answer Corpus

AI2 Science Questions v2 (May 2017)

Amazon QA data - Question and Answer data from Amazon, totaling around 1.4 million answered questions.

Cornell NLVR - is a language grounding dataset. It contains 92,244 pairs of natural language statements grounded in synthetic images. The task is to determine whether a sentence is true or false about an image.

DeepMind QA dataset

Jeopardy data

MS MARCO - A Reading Comprehension Dataset for the Artificial Intelligence research community.

NewsQA - a machine reading comprehension data set similar to SQuAD.

NLIWOD - Collection of tools, utilities, datasets and approaches towards realizing natural language interfaces for the Web of Data. Currently focus is on Question Answering (QA) utilities.

Question answering dataset featured in "Teaching Machines to Read and Comprehend"

QuestionBank

Quora data set of 400M QA pairs

SearchQA - Q&A dataset augmented with context from search engines snippets.

Ubuntu dialogs (IRC chats)

WebQuestions - A set of 6+K questions that are supposed to be answerable by Freebase, a large knowledge graph: see also a Codlab WorkSheet.

WikiaQA - the data set with answer-bearing sentences