How should I integrate Pyspark with Jupyter notebook on Ubuntu 16.04?

 

Prerequisite for fully working of Apache Spark(pyspark) with Jupyter i.e  How to integrate Jupyter notebook and pyspark?

Step 1: – Download and Installed.

  1. Download and install Anaconda. (Anaconda comes with lots of packages like Jupyter, ipython, python3 and many more so no need to install these packages explicitly)
  2. Download and install if not installed Java(Because spark uses JVM to run.)
    to check Java is install run this command in terminal :-  $java -version or $which java (it returns path of java executable.)
  3. Download Spark and untar and move to your desired location and better to rename it as a spark.
  4. Data (in CSV format) to check for a proper working of Apache Spark.

Step 2: – Setting up Environment Variable.

  • Copy the path from your preferred installation and then open /etc/environment using nano or your favorite text editor. Note in setting environment variable path of folder is given not the executable file
    $ sudo nano /etc/environment
  • JAVA_HOME="/usr/lib/jvm/java-8-oracle"
  • PATH=/path/of/Anaconda/bin:$PATH   # (Anaconda bin directory contains jupyter, ipython, python3 )
    To see PATH:- echo $PATH
    Note again:-  executable(software) is search and executed in order as its display in the output in echo $PATH
  • Reload the environment variable file by running this command
    source /etc/environment

Step 3: – Configure Apache Spark file spark-env.sh in conf folder

  • cd /path/of/your/spark/folder/spark/conf/
  • cp spark-env.sh.template spark-env.sh
  • nano spark-env.sh
  • add these line:
    export PYSPARK_PYTHON=/Path/of/anaconda//bin/python3
    export PYSPARK_DRIVER_PYTHON=/Path/of/anaconda//bin/jupyter
    JAVA_HOME=/path/of/java/usr/lib/jvm/java8-oracleStep 4:- Configure Apache Spark pyspark file in bin folder
  • go to line 85 add this
    export PYSPARK_DRIVER_PYTHON=jupyter
  • go to line 86 add this
    export PYSPARK_DRIVER_PYTHON_OPTS=“notebook”
  • Save all
    Step 5: – To Launch pyspark in jupyter which is a web-browser-based version of IPython, use:-
    PYSPARK_DRIVER_PYTHON_OPTS=”notebook” /path/of/spark//spark-1/bin/pyspark




CAP Theorem: Big Data

When you’ll start working with NoSQL databases, sooner or later you come across CAP theorem, The theorem published by Eric Brewer in 2000, that describe any distributed system.

In a distributed database system C stands for Consistency, A stands for Availability and P stands for Partition Tolerance.

  • Consistency – Every node in the system will have the same view of data.
  • Availability – User can read & write from any node.
  • Partition Tolerance – Your system will still Operate even if any node or server fails.

So from the above illustration while selecting your NoSQL database you can choose only two characteristics i.e. A-P, A-C, C-P.

A-P Databases: Voldemort, Cassandra, Riak, CouchDB, Dynamo

A-C Databases: PostgreSQL, MySQL

C-P Databases: HBase, MongoDB, Redis, Big Table

CAP theorem
Credits: Cloudacademy.com

NLTK – Natural Language Processing in Python

In our day-to-day life we generate a lot of data like tweets, facebook posts, comments, Blog posts, articles which are generally in our natural language and which  falls in category of semi-structured  and unstructured data, So as when we process natural language data “the unstructured data – plain text”  we call it Natural Language Processing. 

Natural Language Tool Kit is a library for NLP which deals with natural language such as plain text, words, sentences.

Building blocks of NLTK

  • Tokenizers – Separating the text in to words and sentences

word tokenizer – separate by word

sentence tokenizer – separate by sentence

  • Corpora – body of text such as any written speech, news article.
  • Lexicon – dictionary, meaning of the words. which can be differ in context they are used.

let’s understand how the NLTK works, consider a sample_text such as

sample_text = “Hey jimmy, How are you? Today is my birthday. let’s go for lunch today. I’m throwing a party at the Hard rock cafe
Here if we would like to separate every sentence we could do that with normal programming putting conditions like treat a new sentence after every full stop but in some scenarios our conditions will fail like if we have Ms. in our sentence then it would consider the further content after Ms. a new sentence.
sample_text = “Ms. Margaret, How are you? Today is my birthday. let’s go for lunch today. I’m throwing a party at the Hard rock cafe

 

So NLTK comes to the rescue and separate the body of text (Corpora) in to sentences & words like

 


Spark Standalone Installation – Install Spark to Local Cluster

 

Apache spark can easily be deployed in standalone mode, all you need is to Install Spark to Local Cluster. First download the pre-built spark and extract it. After that, open your terminal navigate to the extracted directory of spark from sbin start master.sh after that start slave.sh followed by master spark URL which will be obtained at localhost:8080. Now you have started a cluster manually.

After that, you can start the Spark-shell (for Scala) or Pyspark (for Python) or SparkR (for R) from bin.

  1. Download pre-built Spark.
  2. Extract the downloaded Spark built (you can extract spark in either way by terminal or manually).
  3. From your terminal navigate to the extracted folder, now you have to start master.sh from sbin command: sbin/start-master.sh
  4. After Master, you need to start slave.sh followed by master spark URL which you’ll get from browser by typing localhost:8080 command: sbin/start-slave.sh <URL>
  5. After performing step 3 & step 4, you have successfully started the cluster manually.
  6. Now you’ll be able to start your applications like Spark-shell, pySpark, SparkR for Scala, Python and R from bin. command: bin/spark-shell
  7. Start writing your code or application.

 

 Screenshots of Standalone Mode

Install Spark to Local Cluster at 2.53.05 PM

Install Spark to Local Cluster at 2.54.35 PM

Install Spark to Local Cluster at 2.55.22 PM

Install Spark to Local Cluster at 2.57.17 PM

Install Spark to Local Cluster at 2.58.33 PM

Install Spark to Local Cluster at 2.59.51 PM

Install Spark to Local Cluster at 3.01.03 PM

Install Spark to Local Cluster at 3.01.46 PM

Install Spark to Local Cluster at 3.02.47 PM

Install Spark to Local Cluster at 3.13.28 PM

 

Install Spark to Local Cluster at 3.16.49 PM

Install Spark to Local Cluster at 4.37.28 PM

 

 

 

for quick basic tutorial referred to official guide.


Introduction iPython Notebook

iPython Notebooks are the best way to showcase your Analysis, with the help of ipython notebooks you can tell stories with your code by embedding different types of visualizations, images and text. These iPython Notebooks are the simplest way to share you whole code history with your team-mates just like a blog.

As the name suggest iPython is it only for python language?

The answer is NO, You can do your analysis or write your code in other popular languages like julia, ruby, javaScript, C#, R, Scala, cython, jython, perl, php, bash, prolog, java, C, C++ and many more.

Make sure you install the specific kernel of the particular programming language. By default ipython kernel is preinstalled.

 

Is it iPython Notebook or jupyter Notebook?

The answer is both, This project was termed as iPython when it was developed and later on it was merged under a parent project named as jupyter notebook, so that it will not only reflected as notebook for python. So in some cases you’ll find people referring jupyter notebooks as ipython notebooks. And for those who have just started using or about to use the notebook both are the same thing don’t get confused.

Try without installing

Online Demo of jupyter notebook (Try the code in Python, Haskell, R, Scala).

Installing iPython Notebook

Simplest installation with Anaconda Python distribution available for Windows, Mac and Ubuntu.

 

Sharing the iPython notebooks

Embedding inside a webpage

  • First download the notebook in .ipynb format.
  • Open the downloaded file .ipynb in notepad (or any other text editor).
  • Select all (Ctrl+A) the contents of the file.
  • Go to https://gist.github.com/
    • Enter the file name with extension & description.
    • Paste the contents that you copied from .ipynb file in the gist
  • Click create public gist.
  • Copy the embedded code, example        <scriptsrc=”https://gist.github.com/AnuragSinghChaudhary/6097a6a447f26d1256fc.js”></script>
  • Paste this code inside any web page under HTML code your python notebook will embed inside the web page.
  • You’ll be able to see the embedded iPython notebook under this web page as example.

Personal Notes:

  • I’m using iPython notebooks for all my analysis practice.
  • I have written this post in context of data science.
  • iPython notebooks can be used in wide variety of context with other programming languages.

Public Datasets for doing Data Analysis

To carry out an impactful Analysis you need a dataset of your choice in some scenario you need a dataset related to finance to carry out financial Analysis, in another scenario you need dataset related to words to carry out sentiment analysis or topic modelling, so there would be multiple scenarios. So the same situation keeps on repeating in the phase of learning new algorithms and implementing them to create models.

I tried to gather some of the openly available data sources that will come handy for most of the people out there that are looking for some datasets that are openly available. The list is not the Datasets-pedia but will surely help some of you.

So here is the list of some Public Datasets for Data Science that can get you started:

Airbnb Dataset – Data behind the Inside Airbnb site sourced from publicly available information from the Airbnb site.

Adult Web – Collection of multiple Datasets of Adult Websites.

IMDB Dataset – A dataset of user-curated movie lists from IMDb.com.

Quantopian Data – Quantopian provides free access to many data sets, including US equity pricing, corporate fundamental data.

Lending Club – Complete/Declined Loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information.

Prosper Loans Network Dataset – Directed network based on loans on the Prosper.com peer-to-peer lending site.

YouTube Dataset – Dataset of Spam Comment Activity.

BBC Datasets – Two text corpora consisting of news articles, particularly suited to evaluating cluster analysis techniques.

Stability Topic Corpora Dataset – Text corpora for benchmarking stability analysis in topic modelling.

Language Modeling Dataset – The WikiText language modelling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Multi-View Twitter Datasets – Collection of Twitter datasets for evaluating multi-view analysis methods.

Yelp – Image Dataset

FBI – United States Crime Dataset

Academic Torrents – Huge Collection of Datasets

Public Health Data –  Datasets from Centers for Disease Control and Prevention, Aids, Birth, Cancer.

Hospital Compare Datasets – Quality of care at over 4,000 Medicare-certified hospitals

Biological Dataset – Disease dataset

Cancer Data: National Cancer Institute Datasets

Quantdl – Live Financial Dataset

Google Trends – Live Data from Google

World Bank Open Data – Global development data

Reddit Datasets  – Various Datasets Updated Daily

Wikipedia – Data dump in Gigabytes

UC Irvine – UCI Machine Learning Repository

re3data.org  – Registry of Research Data Repositories

IMF Datasets – International Monetary Fund, MacroEconomic & Financial Data

Labour Datasets – Datasets published by Bureau of Labour Statistics, United States

Economic Data – Datasets published by Bureau of Economic Analysis, United States

Data.Gov – United States Government Datasets

Data.Gov.uk – UK Government Datasets

Data.Gov.in – INDIAN Government Datasets

Open Data Monitor – Europe Government Data

Data.Europa.EU (The European Union Open Data Portal)

Kaggle – Different types of Datasets

Google – Public Datasets published by Google

Datahub.io – Managed and Published collection of Datasets

Gapminder World  – Demographic Datasets

538 – Datasets related to Opinion poll analysis, Politics, Economics, and Sports

Open Data Network

AWS Public Data Sets – Public Datasets published by Amazon Web Services

Awesome datasets – List of datasets, hosted on GitHub

Data.world – Free, Downloadable datasets.

Personal Notes:

  • This list has the wide variety of datasets, not domain specific.
  • If you think I miss out some useful/important links, let me know in the comments.

 


Data Science Resources

This post is for the folks who just started or about to start learning Data Science, As data science is a very wide field kindly plan your journey according to it. Start with one language, stick with it and try to understand the basic concepts that will lead you to a long way, I have assembled few resources especially for beginner level.

Some Data Science Resources to get you started

Books:

Think Stats: Probability and Statistics for Programmers

Think Bayes: Bayesian Statistics Made Simple

Think Complexity

StatSoft Statistics Textbook 

 

Git & GitHub:

15 minutes to Git

How to Use Git and GitHub 

 

Python:

Intro to Data Processing with Python

Scipy Lecture Notes

Pandas Boot camp

 

Big Data Resources: 
Big Data University
 
Data Visualization:
D3.js Tutorial 
 
Machine Learning Roadmaps: 
MetAcademyDeep Learning: (Not for beginners)Deep Learning Tutorials

Deep Learning Course Stanford / OpenClassroom

 

Personal Notes:

  • Do not try to Learn Everything from Day 1.
  • Start with the language basics, learn how to analyse data.
  • Data visualisation part will come when you are comfortable with the programming language.