Install PySpark 3 on Google Colab the Easy Way
The manual method (the not-so-easy way) and the automated method (the easy way) for PySpark setup on Google Colab
This tutorial will talk about how to set up the Spark environment on Google Colab. Both the manual method (the not-so-easy way) and the automated method (the easy way) will be covered.
Resources for this post:
- Video tutorial for this post on YouTube
- Python code is at the end of the post
- More video tutorials on Google Colab
- More blog posts on Google Colab tutorials
Let’s get started!
Method 1: Manual Installation — the Not-so-easy Way
Firstly, let’s talk about how to install Spark on Google Colab manually.
Step 1.1: Download Java because Spark requires Java Virtual Machine (JVM).
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/nullStep 1.2: Download the latest version of the Apache Spark following the steps below:
- Go to https://spark.apache.org/downloads.html
- Choose a Spark release version and a package type. The default is the latest version. When this tutorial is published, the latest Spark release is 3.2.1, and the package type is
Pre-built for Apache Haddoop 3.3 and later. - Click the link for downloading Spark (the blue link on spark-3.2.1-bin-hadoop3.2.tgz), and you will be directed to a new web page.
- Copy the first link on the web page (https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz), which is below the sentence “We suggest the following site for your download:”.
- Download Spark from the copied link.
- Unzip the file
# Download Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz# Unzip the file
!tar xf spark-3.2.1-bin-hadoop3.2.tgzStep 1.3: Set up the environment for Spark.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = '/content/spark-3.2.1-bin-hadoop3.2'Step 1.4: Install and import the library for locating Spark.
# Install library for finding Spark
!pip install -q findspark# Import the libary
import findspark# Initiate findspark
findspark.init()# Check the location for Spark
findspark.find()Output:
/content/spark-3.2.1-bin-hadoop3.2Step 1.5: Start a Spark session, and check the session information.
# Import SparkSession
from pyspark.sql import SparkSession# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()# Check Spark Session Information
spark
Step 1.6: Test if Spark is installed successfully by importing a Spark library.
# Import a Spark function from library
from pyspark.sql.functions import colThere is no error message after running the code, indicating that Spark is successfully installed.
Method 2: Automatic Installation — the Easy Way
The second method of installing PySpark on Google Colab is to use pip install.
# Install pyspark
!pip install pysparkAfter installation, we can create a Spark session and check its information.
# Import SparkSession
from pyspark.sql import SparkSession# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()# Check Spark Session Information
sparkWe can also test the installation by importing a Spark library.
# Import a Spark function from library
from pyspark.sql.functions import colWhich method to use?
You might wonder which method to use for your project. I suggest using the pip install (the easy way) in most cases, and only consider the manual method if you would like to customize certain settings for the installation.
More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.






