avatarAman Ranjan Verma

Summary

The article provides a step-by-step guide for installing PySpark and Jupyter Notebook on a local macOS system.

Abstract

The article titled "Get started with PySpark on Jupyter Notebook: 3 easy steps" explains how to install PySpark and Jupyter Notebook on a local macOS system. The article is divided into three steps: creating a python virtual environment, installing Jupyter Notebook, and installing PySpark. The article also provides instructions for testing the installation and starting PySpark on Jupyter Notebook. The author emphasizes the importance of installing Java as a prerequisite to using PySpark and provides instructions for installing Homebrew, a package manager for macOS, to simplify the installation process. The article concludes by recommending the use of a virtual environment to segregate different versions of libraries used in Python.

Opinions

  • The author suggests that installing PySpark and Jupyter Notebook in a virtual environment is a good practice to segregate different versions of libraries used in Python.
  • The author emphasizes the importance of installing Java as a prerequisite to using PySpark.
  • The author recommends using Homebrew to simplify the installation process of Java on macOS.
  • The author concludes that PySpark is a powerful tool for data analysis and processing, and using it in combination with Jupyter notebooks makes it even more user-friendly and interactive.

Get started with PySpark on Jupyter Notebook: 3 easy steps

Install PySpark and Notebook on your local macOS and start using them in 3 easy steps

Earlier when I used to work on my previous MacBook, I messed up my local by installing multiple versions of Spark, Python, Java, and notebooks. I really had a tough time debugging and running my Spark Application locally. This was the motivation behind this article of mine:

Then, I was sure that the next time I formatted my MacBook or replace it with a new one I would do everything in a systematic manner. Now that I have a new MacBook pro-M1, this is how I installed PySpark and notebook locally in 3 easy steps. The three steps are:

  • Create virtual environment
  • Install Jupiter Notebook
  • Install PySpark

Step 1: Create a python virtual environment

Check for python3: I have Python 3.9.6, any other version of python 3.x.x should work.

% python3
Python 3.9.6 (default, Oct 18 2022, 12:41:40)
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

Create a virtual environment named pyspark. The 1st cmd creates the environment. You can check this by running ls and finding a directory named pyspark. Next, you can activate your env running the 3rd cmd.

% python3 -m venv pyspark

% ls
pyspark

% source pyspark/bin/activate
(pyspark) % 

Step 2: Install Notebook

In the same tab where you activated the env, run these two commands.

(pyspark) % python3 -m pip install --upgrade pip
(pyspark) % python3 -m pip install jupyter

You can test the notebook installation by running this cmd. This will open the notebook in your default browser.

(pyspark) % jupyter notebook
Notebook Screenshot

You can close the notebook server by ctrl+c, followed by Y and return.

Step 3: Install PySpark

Java is a prerequisite to using PySpark. In a different tab of the terminal, check if you have Java installed.

Java is not installed prompt: Follow the java installation guide

% java -version
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.

Java is installed prompt: Skip the Java installation guide

% java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)

Prerequisites

Install Homebrew

There are multiple ways to install java, we will follow the easy route and use the Homebrew(package manager for macOS).

Check if the brew is installed with brew help. If not install it running this cmd. At the end of the run log, follow the instructions under ==> Next steps:. Run the echo cmds.

% /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

==> Next steps:
- Run these three commands in your terminal to add Homebrew to your PATH:
    echo '# Set PATH, MANPATH, etc., for Homebrew.' >> /Users/amanranjanverma/.zprofile
    echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/amanranjanverma/.zprofile
    eval "$(/opt/homebrew/bin/brew shellenv)"
- Run brew help to get started
- Further documentation:
    https://docs.brew.sh

Install Java

Run the cmd brew install openjdk@11. At the end of the run log, go through the instructions under ==> openjdk@11.

% brew install openjdk@11

==> openjdk@11
For the system Java wrappers to find this JDK, symlink it with
  sudo ln -sfn /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk

openjdk@11 is keg-only, which means it was not symlinked into /opt/homebrew,
because this is an alternate version of another formula.

If you need to have openjdk@11 first in your PATH, run:
  echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc

For compilers to find openjdk@11 you may need to set:
  export CPPFLAGS="-I/opt/homebrew/opt/openjdk@11/include"

Install PySpark

Come back to the previous tab where you activated the env and run this cmd.

(pyspark) % python3 -m pip install pyspark

Successfully installed py4j-0.10.9.5 pyspark-3.3.1

Start using PySpark on Notebook

(pyspark) % jupyter notebook

You can create a new python3 kernel notebook and use this code to test the installation.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,sum,avg,max

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

Why we installed the PySpark in a virtual environment?

This is to segregate all the different versions of libraries that we use in Python. If suppose you mess things up, you can just delete the env directory in this case pyspark and start fresh. There is one overhead though, every time you want to use a notebook or PySpark shell you will have to activate the env, and once done with the work, you can deactivate it by using the cmd % deactivate.

You can also work on PySpark Shell like this:

In conclusion, PySpark is a powerful tool for data analysis and processing, and using it in combination with Jupyter notebooks makes it even more user-friendly and interactive. By following the three simple steps outlined in this blog, you can easily get started with PySpark on Jupyter notebooks and begin harnessing its capabilities for your own data projects. Whether you are a data scientist, data engineer, or simply someone interested in analyzing and manipulating large datasets, PySpark is a valuable tool to have in your toolkit. So why wait? Follow these steps and get started with PySpark on Jupyter notebooks today!

Additionally, you might be interested in learning:

If you are looking to prepare for a Data Engineering interview do check out my interview blog series:

Data
Data Scientist
Data Science
Data Analysis
Data Engineering
Recommended from ReadMedium