English┬╗Data Sets and State-of-the-art (SOTA) | searchivarius.org
log in | about 

Audio and Speech


Duplicate Detection & Record Linkage

Generic IR


Learning to Rank

Question Answering (QA)

Social Networks

State-of-the-art results (SOTA)

Text Mining and NLP
Question Answering (QA), Catalogs/lists, Sentiment Analysis and Opinion Mining...

User behavior




100+ Interesting Data Sets for Statistics  Robert Seaton
1940 USA census  
6 Dataset Lists Curated by Data Scientists  
A comprehensive list of data sets for machine learning  
Allen Brain Observatory   - standardized in vivo survey of physiological activity in the mouse visual cortex.
Data Depot   - DataDepot is a set of tools for collaboratively uploading, sharing, and analyzing data. You can use DataDepot to track personal data, to explore public data, and to engage with scientific data.
Datasets for Data Mining, Analytics and Knowledge Discovery  
LinkData.org   - a data publishing community/hub website.
Linked Data @ VU  
Mathematical Retrieval Project  
Million Song Dataset  
Nomao datasets   - Data deduplication, learning to rank, online reviews, recommendations, text generation, voting networks.
Open speech corpora list  Josh Meyer
Pizza&Chili Corpus  Gonzalo Navarro, Paolo Ferragina
Publicly Available Large Data Sets for DB Research  Daniel Lemire
RedditSota   - State-of-the-art result for all Machine Learning Problems
Research Pipeline Data sets  
Teens and Online Privacy   - 2012 survey with questions about teens' attitudes towards privacy and their information management practices.
Time series data: classification and clustering datasets   - A very diverse set of clustering/classification data.
UCI machine learning repository   - UC Irvine Machine Learning Repository
Yahoo Data Sets   - Includes n-grams and anonymized query logs.