Category: Wiki

11 Data Mining Algorithms

algorithms

1. Regression & Classification

  • Linear
  • Multivariate Linear
  • Logistic
  • Softmax
  • Vectorization
  • Gradient Calculation
  • Stochastic Gradient Descent (SGD)
  • Optimizers and Objectives

2. Regularization

  • Ridge regression

3. Clustering

  • k – Means
  • EM Algorithms

4. Unsupervised Learning

  • Autoencoders
  • PCA Whitening
  • sparse coding

5. Neural Network

  • Perceptrons
  • Backpropagation
  • Restricted Boltzmann Machines
  • Learning Vector Quantization

6. Deep Learning

  • Stacked Autoencoders
  • Convolution Neural Networks (Feature Extraction, Pooling)
  • Deep Boltzmann Machines
  • Deep Belief Networks

7. Decision Trees

  • ID3
  • C4.5
  • CART (Classification and regression tree)
  • Random Forests

8. Bayesian

  • Naïve Bayes
  • Gaussian Naïve Bayes
  • Bayesian Networks
  • Conditional Random Fields
  • Hidden Markov Models

9. Others

  • Support Vector Machines
  • Evolutionary Methods
  • Reinforcement Learning
  • Conditional Random Fields

10. Dimensionality Reduction

  • PCA

11. Ensemble Methods

  • Boosting
  • Bagging
  • Adaboost

Public Datasets for doing Data Analysis

Public Datasets for Data Science

To carry out an impactful Analysis you need a dataset of your choice in some scenario you need a dataset related to finance to carry out financial Analysis, in another scenario you need dataset related to words to carry out sentiment analysis or topic modelling, so there would be multiple scenarios. So the same situation keeps on repeating in the phase of learning new algorithms and implementing them to create models.

I tried to gather some of the openly available data sources that will come handy for most of the people out there that are looking for some datasets that are openly available. The list is not the Datasets-pedia but will surely help some of you.

So here is the list of some Public Datasets for Data Science that can get you started:

Airbnb Dataset – Data behind the Inside Airbnb site sourced from publicly available information from the Airbnb site.

Adult Web – Collection of multiple Datasets of Adult Websites.

IMDB Dataset – A dataset of user-curated movie lists from IMDb.com.

Quantopian Data – Quantopian provides free access to many data sets, including US equity pricing, corporate fundamental data.

Lending Club – Complete/Declined Loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information.

Prosper Loans Network Dataset – Directed network based on loans on the Prosper.com peer-to-peer lending site.

YouTube Dataset – Dataset of Spam Comment Activity.

BBC Datasets – Two text corpora consisting of news articles, particularly suited to evaluating cluster analysis techniques.

Stability Topic Corpora Dataset – Text corpora for benchmarking stability analysis in topic modelling.

Language Modeling Dataset – The WikiText language modelling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Multi-View Twitter Datasets – Collection of Twitter datasets for evaluating multi-view analysis methods.

Yelp – Image Dataset

FBI – United States Crime Dataset

Academic Torrents – Huge Collection of Datasets

Public Health Data –  Datasets from Centers for Disease Control and Prevention, Aids, Birth, Cancer.

Hospital Compare Datasets – Quality of care at over 4,000 Medicare-certified hospitals

Biological Dataset – Disease dataset

Cancer Data: National Cancer Institute Datasets

Quantdl – Live Financial Dataset

Google Trends – Live Data from Google

World Bank Open Data – Global development data

Reddit Datasets  – Various Datasets Updated Daily

Wikipedia – Data dump in Gigabytes

UC Irvine – UCI Machine Learning Repository

re3data.org  – Registry of Research Data Repositories

IMF Datasets – International Monetary Fund, MacroEconomic & Financial Data

Labour Datasets – Datasets published by Bureau of Labour Statistics, United States

Economic Data – Datasets published by Bureau of Economic Analysis, United States

Data.Gov – United States Government Datasets

Data.Gov.uk – UK Government Datasets

Data.Gov.in – INDIAN Government Datasets

Open Data Monitor – Europe Government Data

Data.Europa.EU (The European Union Open Data Portal)

Kaggle – Different types of Datasets

Google – Public Datasets published by Google

Datahub.io – Managed and Published collection of Datasets

Gapminder World  – Demographic Datasets

538 – Datasets related to Opinion poll analysis, Politics, Economics, and Sports

Open Data Network

AWS Public Data Sets – Public Datasets published by Amazon Web Services

Awesome datasets – List of datasets, hosted on GitHub

Data.world – Free, Downloadable datasets.

Personal Notes:

  • This list has the wide variety of datasets, not domain specific.
  • If you think I miss out some useful/important links, let me know in the comments.

 

Data Science Resources

Data Science Resources

This post is for the folks who just started or about to start learning Data Science, As data science is a very wide field kindly plan your journey according to it. Start with one language, stick with it and try to understand the basic concepts that will lead you to a long way, I have assembled few resources especially for beginner level.

Some Data Science Resources to get you started

Books:

Think Stats: Probability and Statistics for Programmers

Think Bayes: Bayesian Statistics Made Simple

Think Complexity

StatSoft Statistics Textbook 

 

Git & GitHub:

15 minutes to Git

How to Use Git and GitHub 

 

Python:

Intro to Data Processing with Python

Scipy Lecture Notes

Pandas Boot camp

 

Big Data Resources: 
Big Data University
 
Data Visualization:
D3.js Tutorial 
 
Machine Learning Roadmaps: 
MetAcademyDeep Learning: (Not for beginners)Deep Learning Tutorials

Deep Learning Course Stanford / OpenClassroom

 

Personal Notes:

  • Do not try to Learn Everything from Day 1.
  • Start with the language basics, learn how to analyse data.
  • Data visualisation part will come when you are comfortable with the programming language.