How should I integrate Pyspark with Jupyter notebook on Ubuntu 16.04?
Prerequisite for fully working of Apache Spark(pyspark) with Jupyter i.e How to integrate Jupyter notebook and pyspark?
Step 1: – Download and Installed.
- Download and install Anaconda. (Anaconda comes with lots of packages like Jupyter, ipython, python3 and many more so no need to install these packages explicitly)
- Download and install if not installed Java(Because spark uses JVM to run.)
to check Java is install run this command in terminal :- $java -version or $which java (it returns path of java executable.)
- Download Spark and untar and move to your desired location and better to rename it as a spark.
- Data (in CSV format) to check for a proper working of Apache Spark.
Step 2: – Setting up Environment Variable.
- Copy the path from your preferred installation and then open
nanoor your favorite text editor. Note in setting environment variable path of folder is given not the executable file
$ sudo nano /etc/environment
- PATH=/path/of/Anaconda/bin:$PATH # (Anaconda bin directory contains jupyter, ipython, python3 )
To see PATH:- echo $PATH
Note again:- executable(software) is search and executed in order as its display in the output in echo $PATH
- Reload the environment variable file by running this command
Step 3: – Configure Apache Spark file spark-env.sh in conf folder
- cd /path/of/your/spark/folder/spark/conf/
- cp spark-env.sh.template spark-env.sh
- nano spark-env.sh
- add these line:
JAVA_HOME=/path/of/java/usr/lib/jvm/java–8-oracleStep 4:- Configure Apache Spark pyspark file in bin folder
- go to line 85 add this
- go to line 86 add this
- Save all
Step 5: – To Launch pyspark in jupyter which is a web-browser-based version of IPython, use:-