Tag: Featured

Python Libraries for Data Science

Python Libraries for Data Science

Data Analysis – Machine Learning

Pandas (data preparation)

Pandas help you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R. Practical real world data analysis, reading and writing data, data alignment, reshaping, slicing, fancy indexing, and subsetting, size mutability, merging and joining, Hierarchical axis indexing, Time series-functionality.

See More: Pandas Documentation

Scikit-learn (Machine Learning)

  • Simple and efficient tools for implementing Classification, Regression, Clustering, Dimensionality Reduction, Model Selection, Preprocessing.
  • Built on NumPy, SciPy, and Matplotlib.

See More: Scikit-learn Documentation

Gensim (Topic Modelling)

Scalable statistical semantics, Analyse plain-text documents for semantic structure and Retrieve semantically similar documents.

See More: Gensim Documentation 

NLTK (Natural Language Processing)

Text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries. Working with corpora, categorising text, analysing linguistic structure.

See More: NLTK Documentation

Tables

Package for managing hierarchical datasets which are designed to efficiently cope with large amounts of data. It is built on top of the HDF5 library and the NumPy package and features an object-oriented interface which is fast, extremely easy to use tool for interactively save and retrieve large amounts of data.

See More: Tables Documentation

Deep Learning

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.

See More: Deep Learning Documentation

Data Visualization

Seaborn 

Seaborn is a Python visualisation library based on Matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

See More: Seaborn Documentation

Matplotlib

It is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

See  More:  Matplotlib Documentation

Bokeh

Bokeh is a Python interactive visualisation library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

See More: Bokeh Documentation

Sci-py (data quality)

Python library used for scientific computing and technical computing.

SciPy contains modules for optimisation, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

SciPy builds on the NumPy array object and is part of the NumPy stack which includes tools like Matplotlib, pandas and SymPy.

See More: Sci-py Documentation

Big Data/Distributed Computing

Hdfs3

hdfs3 is a lightweight Python wrapper for libhdfs3,  to interact with the Hadoop File System HDFS.

See More: Hdfs3 Documentation

Luigi

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualisation, handling failures, command line integration, and much more.

See More: Luigi Documentation

Hfpy

It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorised and tagged however you want. H5py uses straightforward NumPy and Python metaphors, like dictionary and NumPy array syntax. For example, you can iterate over data sets in a file, or check out the .shape or .dtype attributes of datasets.

See More: H5py Documentation

Pymongo

PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python.

See More: PyMongo Documentation

DASK

Dask is a flexible parallel computing library for analytic computing.  Dask has two main components Dynamic task scheduling optimised for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimised for interactive computational workloads.“Big Data” collections like parallel arrays, data frames, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

See More: Dask Documentation

Dask.distributed

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the and concurrent.futures dask APIs to moderate sized clusters. Distributed serves to complement the existing PyData analysis stack to meet the following needs Low latency, Peer-to-peer data sharing, Complex Scheduling, Pure Python, Data Locality, Familiar APIs, Easy Setup.

See More: Dask.distributed Documentation

Security

  • cryptography
  • pyOpenSSL
  • passlib
  • requests-oauthlib
  • ecdsa
  • pycrypto
  • oauthlib
  • oauth2client
  • wincertstore
  • rsa

So these are some of the Python Libraries for Data Science, data analysis, Machine Learning, Security and Distributed computing.

If you think i miss out Something, let me know in the comments.

Introduction iPython Notebook

ipython notebook

iPython Notebooks are the best way to showcase your Analysis, with the help of ipython notebooks you can tell stories with your code by embedding different types of visualizations, images and text. These iPython Notebooks are the simplest way to share you whole code history with your team-mates just like a blog.

As the name suggest iPython is it only for python language?

The answer is NO, You can do your analysis or write your code in other popular languages like julia, ruby, javaScript, C#, R, Scala, cython, jython, perl, php, bash, prolog, java, C, C++ and many more.

Make sure you install the specific kernel of the particular programming language. By default ipython kernel is preinstalled.

 

Is it iPython Notebook or jupyter Notebook?

The answer is both, This project was termed as iPython when it was developed and later on it was merged under a parent project named as jupyter notebook, so that it will not only reflected as notebook for python. So in some cases you’ll find people referring jupyter notebooks as ipython notebooks. And for those who have just started using or about to use the notebook both are the same thing don’t get confused.

Try without installing

Online Demo of jupyter notebook (Try the code in Python, Haskell, R, Scala).

Installing iPython Notebook

Simplest installation with Anaconda Python distribution available for Windows, Mac and Ubuntu.

 

Sharing the iPython notebooks

Embedding inside a webpage

  • First download the notebook in .ipynb format.
  • Open the downloaded file .ipynb in notepad (or any other text editor).
  • Select all (Ctrl+A) the contents of the file.
  • Go to https://gist.github.com/
    • Enter the file name with extension & description.
    • Paste the contents that you copied from .ipynb file in the gist
  • Click create public gist.
  • Copy the embedded code, example        <scriptsrc=”https://gist.github.com/AnuragSinghChaudhary/6097a6a447f26d1256fc.js”></script>
  • Paste this code inside any web page under HTML code your python notebook will embed inside the web page.
  • You’ll be able to see the embedded iPython notebook under this web page as example.

Personal Notes:

  • I’m using iPython notebooks for all my analysis practice.
  • I have written this post in context of data science.
  • iPython notebooks can be used in wide variety of context with other programming languages.

Public Datasets for doing Data Analysis

Public Datasets for Data Science

To carry out an impactful Analysis you need a dataset of your choice in some scenario you need a dataset related to finance to carry out financial Analysis, in another scenario you need dataset related to words to carry out sentiment analysis or topic modelling, so there would be multiple scenarios. So the same situation keeps on repeating in the phase of learning new algorithms and implementing them to create models.

I tried to gather some of the openly available data sources that will come handy for most of the people out there that are looking for some datasets that are openly available. The list is not the Datasets-pedia but will surely help some of you.

So here is the list of some Public Datasets for Data Science that can get you started:

Airbnb Dataset – Data behind the Inside Airbnb site sourced from publicly available information from the Airbnb site.

Adult Web – Collection of multiple Datasets of Adult Websites.

IMDB Dataset – A dataset of user-curated movie lists from IMDb.com.

Quantopian Data – Quantopian provides free access to many data sets, including US equity pricing, corporate fundamental data.

Lending Club – Complete/Declined Loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information.

Prosper Loans Network Dataset – Directed network based on loans on the Prosper.com peer-to-peer lending site.

YouTube Dataset – Dataset of Spam Comment Activity.

BBC Datasets – Two text corpora consisting of news articles, particularly suited to evaluating cluster analysis techniques.

Stability Topic Corpora Dataset – Text corpora for benchmarking stability analysis in topic modelling.

Language Modeling Dataset – The WikiText language modelling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Multi-View Twitter Datasets – Collection of Twitter datasets for evaluating multi-view analysis methods.

Yelp – Image Dataset

FBI – United States Crime Dataset

Academic Torrents – Huge Collection of Datasets

Public Health Data –  Datasets from Centers for Disease Control and Prevention, Aids, Birth, Cancer.

Hospital Compare Datasets – Quality of care at over 4,000 Medicare-certified hospitals

Biological Dataset – Disease dataset

Cancer Data: National Cancer Institute Datasets

Quantdl – Live Financial Dataset

Google Trends – Live Data from Google

World Bank Open Data – Global development data

Reddit Datasets  – Various Datasets Updated Daily

Wikipedia – Data dump in Gigabytes

UC Irvine – UCI Machine Learning Repository

re3data.org  – Registry of Research Data Repositories

IMF Datasets – International Monetary Fund, MacroEconomic & Financial Data

Labour Datasets – Datasets published by Bureau of Labour Statistics, United States

Economic Data – Datasets published by Bureau of Economic Analysis, United States

Data.Gov – United States Government Datasets

Data.Gov.uk – UK Government Datasets

Data.Gov.in – INDIAN Government Datasets

Open Data Monitor – Europe Government Data

Data.Europa.EU (The European Union Open Data Portal)

Kaggle – Different types of Datasets

Google – Public Datasets published by Google

Datahub.io – Managed and Published collection of Datasets

Gapminder World  – Demographic Datasets

538 – Datasets related to Opinion poll analysis, Politics, Economics, and Sports

Open Data Network

AWS Public Data Sets – Public Datasets published by Amazon Web Services

Awesome datasets – List of datasets, hosted on GitHub

Data.world – Free, Downloadable datasets.

Personal Notes:

  • This list has the wide variety of datasets, not domain specific.
  • If you think I miss out some useful/important links, let me know in the comments.

 

Data Science Resources

Data Science Resources

This post is for the folks who just started or about to start learning Data Science, As data science is a very wide field kindly plan your journey according to it. Start with one language, stick with it and try to understand the basic concepts that will lead you to a long way, I have assembled few resources especially for beginner level.

Some Data Science Resources to get you started

Books:

Think Stats: Probability and Statistics for Programmers

Think Bayes: Bayesian Statistics Made Simple

Think Complexity

StatSoft Statistics Textbook 

 

Git & GitHub:

15 minutes to Git

How to Use Git and GitHub 

 

Python:

Intro to Data Processing with Python

Scipy Lecture Notes

Pandas Boot camp

 

Big Data Resources: 
Big Data University
 
Data Visualization:
D3.js Tutorial 
 
Machine Learning Roadmaps: 
MetAcademyDeep Learning: (Not for beginners)Deep Learning Tutorials

Deep Learning Course Stanford / OpenClassroom

 

Personal Notes:

  • Do not try to Learn Everything from Day 1.
  • Start with the language basics, learn how to analyse data.
  • Data visualisation part will come when you are comfortable with the programming language.