Prerequisite for fully working of Apache Spark(pyspark) with Jupyter i.e How to integrate Jupyter notebook and pyspark?
Step 1: – Download and Installed.
- Download and install Anaconda. (Anaconda comes with lots of packages like Jupyter, ipython, python3 and many more so no need to install these packages explicitly)
- Download and install if not installed Java(Because spark uses JVM to run.)
to check Java is install run this command in terminal :- $java -version or $which java (it returns path of java executable.) - Download Spark and untar and move to your desired location and better to rename it as a spark.
- Data (in CSV format) to check for a proper working of Apache Spark.
Step 2: – Setting up Environment Variable.
- Copy the path from your preferred installation and then open
/etc/environment
usingnano
or your favorite text editor. Note in setting environment variable path of folder is given not the executable file
$ sudo nano /etc/environment -
JAVA_HOME="/usr/lib/jvm/java-8-oracle"
- PATH=/path/of/Anaconda/bin:$PATH # (Anaconda bin directory contains jupyter, ipython, python3 )
To see PATH:- echo $PATH
Note again:- executable(software) is search and executed in order as its display in the output in echo $PATH - Reload the environment variable file by running this command
source /etc/environment
Step 3: – Configure Apache Spark file spark-env.sh in conf folder
- cd /path/of/your/spark/folder/spark/conf/
- cp spark-env.sh.template spark-env.sh
- nano spark-env.sh
- add these line:
export PYSPARK_PYTHON=/Path/of/anaconda//bin/python3
export PYSPARK_DRIVER_PYTHON=/Path/of/anaconda//bin/jupyter
JAVA_HOME=/path/of/java/usr/lib/jvm/java–8-oracleStep 4:- Configure Apache Spark pyspark file in bin folder - go to line 85 add this
export PYSPARK_DRIVER_PYTHON=“jupyter“ - go to line 86 add this
export PYSPARK_DRIVER_PYTHON_OPTS=“notebook” - Save all
Step 5: – To Launch pyspark in jupyter which is a web-browser-based version of IPython, use:-
PYSPARK_DRIVER_PYTHON_OPTS=”notebook” /path/of/spark//spark-1/bin/pyspark