To carry out an impactful Analysis you need a dataset of your choice in some scenario you need a dataset related to finance to carry out financial Analysis, in another scenario you need dataset related to words to carry out sentiment analysis or topic modelling, so there would be multiple scenarios. So the same situation keeps on repeating in the phase of learning new algorithms and implementing them to create models.
I tried to gather some of the openly available data sources that will come handy for most of the people out there that are looking for some datasets that are openly available. The list is not the Datasets-pedia but will surely help some of you.
So here is the list of some Public Datasets for Data Science that can get you started:
Airbnb Dataset – Data behind the Inside Airbnb site sourced from publicly available information from the Airbnb site.
Adult Web – Collection of multiple Datasets of Adult Websites.
IMDB Dataset – A dataset of user-curated movie lists from IMDb.com.
Lending Club – Complete/Declined Loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information.
Prosper Loans Network Dataset – Directed network based on loans on the Prosper.com peer-to-peer lending site.
YouTube Dataset – Dataset of Spam Comment Activity.
BBC Datasets – Two text corpora consisting of news articles, particularly suited to evaluating cluster analysis techniques.
Stability Topic Corpora Dataset – Text corpora for benchmarking stability analysis in topic modelling.
Language Modeling Dataset – The WikiText language modelling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
Multi-View Twitter Datasets – Collection of Twitter datasets for evaluating multi-view analysis methods.
Yelp – Image Dataset
FBI – United States Crime Dataset
Academic Torrents – Huge Collection of Datasets
Public Health Data – Datasets from Centers for Disease Control and Prevention, Aids, Birth, Cancer.
Biological Dataset – Disease dataset
Cancer Data: National Cancer Institute Datasets
Quantdl – Live Financial Dataset
Google Trends – Live Data from Google
World Bank Open Data – Global development data
Reddit Datasets – Various Datasets Updated Daily
Wikipedia – Data dump in Gigabytes
UC Irvine – UCI Machine Learning Repository
re3data.org – Registry of Research Data Repositories
IMF Datasets – International Monetary Fund, MacroEconomic & Financial Data
Labour Datasets – Datasets published by Bureau of Labour Statistics, United States
Economic Data – Datasets published by Bureau of Economic Analysis, United States
Data.Gov – United States Government Datasets
Data.Gov.uk – UK Government Datasets
Data.Gov.in – INDIAN Government Datasets
Open Data Monitor – Europe Government Data
Data.Europa.EU (The European Union Open Data Portal)
Kaggle – Different types of Datasets
Google – Public Datasets published by Google
Datahub.io – Managed and Published collection of Datasets
Personal Notes:
- This list has the wide variety of datasets, not domain specific.
- If you think I miss out some useful/important links, let me know in the comments.