Get started with PySpark on Jupyter Notebook: 3 easy steps
Install PySpark and Notebook on your local macOS and start using them in 3 easy steps

Earlier when I used to work on my previous MacBook, I messed up my local by installing multiple versions of Spark, Python, Java, and notebooks. I really had a tough time debugging and running my Spark Application locally. This was the motivation behind this article of mine:
Then, I was sure that the next time I formatted my MacBook or replace it with a new one I would do everything in a systematic manner. Now that I have a new MacBook pro-M1, this is how I installed PySpark and notebook locally in 3 easy steps. The three steps are:
- Create virtual environment
- Install Jupiter Notebook
- Install PySpark
Step 1: Create a python virtual environment
Check for python3
: I have Python 3.9.6, any other version of python 3.x.x should work.
% python3
Python 3.9.6 (default, Oct 18 2022, 12:41:40)
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
Create a virtual environment named pyspark
. The 1st cmd creates the environment. You can check this by running ls
and finding a directory named pyspark
. Next, you can activate your env running the 3rd cmd.
% python3 -m venv pyspark
% ls
pyspark
% source pyspark/bin/activate
(pyspark) %
Step 2: Install Notebook
In the same tab where you activated the env, run these two commands.
(pyspark) % python3 -m pip install --upgrade pip (pyspark) % python3 -m pip install jupyter
You can test the notebook installation by running this cmd. This will open the notebook in your default browser.
(pyspark) % jupyter notebook

You can close the notebook server by ctrl+c
, followed by Y
and return.
Step 3: Install PySpark
Java is a prerequisite to using PySpark. In a different tab of the terminal, check if you have Java installed.
Java is not installed prompt: Follow the java installation guide
% java -version
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.
Java is installed prompt: Skip the Java installation guide
% java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
Prerequisites
Install Homebrew
There are multiple ways to install java, we will follow the easy route and use the Homebrew(package manager for macOS).
Check if the brew is installed with brew help
. If not install it running this cmd. At the end of the run log, follow the instructions under ==> Next steps:
. Run the echo
cmds.
% /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
==> Next steps:
- Run these three commands in your terminal to add Homebrew to your PATH:
echo '# Set PATH, MANPATH, etc., for Homebrew.' >> /Users/amanranjanverma/.zprofile
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/amanranjanverma/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"
- Run brew help to get started
- Further documentation:
https://docs.brew.sh
Install Java
Run the cmd brew install openjdk@11. At the end of the run log, go through the instructions under ==> openjdk@11
.
% brew install openjdk@11
==> openjdk@11
For the system Java wrappers to find this JDK, symlink it with
sudo ln -sfn /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
openjdk@11 is keg-only, which means it was not symlinked into /opt/homebrew,
because this is an alternate version of another formula.
If you need to have openjdk@11 first in your PATH, run:
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
For compilers to find openjdk@11 you may need to set:
export CPPFLAGS="-I/opt/homebrew/opt/openjdk@11/include"
Install PySpark
Come back to the previous tab where you activated the env and run this cmd.
(pyspark) % python3 -m pip install pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.1
Start using PySpark on Notebook
(pyspark) % jupyter notebook
You can create a new python3 kernel notebook and use this code to test the installation.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,sum,avg,max
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)


Why we installed the PySpark in a virtual environment?
This is to segregate all the different versions of libraries that we use in Python. If suppose you mess things up, you can just delete the env directory in this case pyspark
and start fresh. There is one overhead though, every time you want to use a notebook or PySpark shell you will have to activate the env, and once done with the work, you can deactivate it by using the cmd % deactivate
.
You can also work on PySpark Shell like this:

In conclusion, PySpark is a powerful tool for data analysis and processing, and using it in combination with Jupyter notebooks makes it even more user-friendly and interactive. By following the three simple steps outlined in this blog, you can easily get started with PySpark on Jupyter notebooks and begin harnessing its capabilities for your own data projects. Whether you are a data scientist, data engineer, or simply someone interested in analyzing and manipulating large datasets, PySpark is a valuable tool to have in your toolkit. So why wait? Follow these steps and get started with PySpark on Jupyter notebooks today!
Additionally, you might be interested in learning:

If you are looking to prepare for a Data Engineering interview do check out my interview blog series: