How should I integrate Pyspark with Jupyter notebook on Ubuntu 16.04?


Prerequisite for fully working of Apache Spark(pyspark) with Jupyter i.e  How to integrate Jupyter notebook and pyspark?

Step 1: – Download and Installed.

  1. Download and install Anaconda. (Anaconda comes with lots of packages like Jupyter, ipython, python3 and many more so no need to install these packages explicitly)
  2. Download and install if not installed Java(Because spark uses JVM to run.)
    to check Java is install run this command in terminal :-  $java -version or $which java (it returns path of java executable.)
  3. Download Spark and untar and move to your desired location and better to rename it as a spark.
  4. Data (in CSV format) to check for a proper working of Apache Spark.

Step 2: – Setting up Environment Variable.

  • Copy the path from your preferred installation and then open /etc/environment using nano or your favorite text editor. Note in setting environment variable path of folder is given not the executable file
    $ sudo nano /etc/environment
  • JAVA_HOME="/usr/lib/jvm/java-8-oracle"
  • PATH=/path/of/Anaconda/bin:$PATH   # (Anaconda bin directory contains jupyter, ipython, python3 )
    To see PATH:- echo $PATH
    Note again:-  executable(software) is search and executed in order as its display in the output in echo $PATH
  • Reload the environment variable file by running this command
    source /etc/environment

Step 3: – Configure Apache Spark file in conf folder

  • cd /path/of/your/spark/folder/spark/conf/
  • cp
  • nano
  • add these line:
    export PYSPARK_PYTHON=/Path/of/anaconda//bin/python3
    export PYSPARK_DRIVER_PYTHON=/Path/of/anaconda//bin/jupyter
    JAVA_HOME=/path/of/java/usr/lib/jvm/java8-oracleStep 4:- Configure Apache Spark pyspark file in bin folder
  • go to line 85 add this
    export PYSPARK_DRIVER_PYTHON=jupyter
  • go to line 86 add this
    export PYSPARK_DRIVER_PYTHON_OPTS=“notebook”
  • Save all
    Step 5: – To Launch pyspark in jupyter which is a web-browser-based version of IPython, use:-
    PYSPARK_DRIVER_PYTHON_OPTS=”notebook” /path/of/spark//spark-1/bin/pyspark



Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.