avatarAmy @GrabNGoInfo

Summary

The article provides a comprehensive guide on setting up PySpark 3 on Google Colab, detailing both manual and automated installation methods.

Abstract

The tutorial covers two methods for installing PySpark on Google Colab: a manual approach that involves downloading Java, Apache Spark, setting up the environment, and configuring Spark settings; and an automated approach using pip to install PySpark directly. The manual method is described as "not-so-easy" and includes steps for downloading Java, the Spark package, unzipping the file, setting environment variables, and initializing a Spark session. The automated method is deemed "the easy way" and simply requires running a pip install command. The article also provides resources such as video tutorials, links to the author's Medium posts, and a YouTube channel for further learning. The author suggests using the pip install method for most cases unless specific customization is needed.

Opinions

  • The author, Amy, recommends the pip install method for its simplicity and suitability for most projects.
  • The manual installation method is considered more complex and is suggested for those who need to customize their Spark setup.
  • Amy provides a referral link to join Medium, indicating a preference for readers to support her work through membership.
  • The inclusion of multimedia resources like YouTube tutorials suggests the author values diverse learning mediums and encourages readers to engage with content beyond the written article.
  • The author emphasizes the ease of the automated installation method, potentially to encourage readers who may be intimidated by the manual process.

Install PySpark 3 on Google Colab the Easy Way

The manual method (the not-so-easy way) and the automated method (the easy way) for PySpark setup on Google Colab

Photo by Dawid Zawiła on Unsplash

This tutorial will talk about how to set up the Spark environment on Google Colab. Both the manual method (the not-so-easy way) and the automated method (the easy way) will be covered.

Resources for this post:

Let’s get started!

Method 1: Manual Installation — the Not-so-easy Way

Firstly, let’s talk about how to install Spark on Google Colab manually.

Step 1.1: Download Java because Spark requires Java Virtual Machine (JVM).

# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Step 1.2: Download the latest version of the Apache Spark following the steps below:

  1. Go to https://spark.apache.org/downloads.html
  2. Choose a Spark release version and a package type. The default is the latest version. When this tutorial is published, the latest Spark release is 3.2.1, and the package type is Pre-built for Apache Haddoop 3.3 and later.
  3. Click the link for downloading Spark (the blue link on spark-3.2.1-bin-hadoop3.2.tgz), and you will be directed to a new web page.
  4. Copy the first link on the web page (https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz), which is below the sentence “We suggest the following site for your download:”.
  5. Download Spark from the copied link.
  6. Unzip the file
# Download Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
# Unzip the file
!tar xf spark-3.2.1-bin-hadoop3.2.tgz

Step 1.3: Set up the environment for Spark.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = '/content/spark-3.2.1-bin-hadoop3.2'

Step 1.4: Install and import the library for locating Spark.

# Install library for finding Spark
!pip install -q findspark
# Import the libary
import findspark
# Initiate findspark
findspark.init()
# Check the location for Spark
findspark.find()

Output:

/content/spark-3.2.1-bin-hadoop3.2

Step 1.5: Start a Spark session, and check the session information.

# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark
SparkSession Information — Image from GrabNGoInfo.com

Step 1.6: Test if Spark is installed successfully by importing a Spark library.

# Import a Spark function from library
from pyspark.sql.functions import col

There is no error message after running the code, indicating that Spark is successfully installed.

Method 2: Automatic Installation — the Easy Way

The second method of installing PySpark on Google Colab is to use pip install.

# Install pyspark
!pip install pyspark

After installation, we can create a Spark session and check its information.

# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

We can also test the installation by importing a Spark library.

# Import a Spark function from library
from pyspark.sql.functions import col

Which method to use?

You might wonder which method to use for your project. I suggest using the pip install (the easy way) in most cases, and only consider the manual method if you would like to customize certain settings for the installation.

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.

References

Google Colab
Pyspark
Recommended from ReadMedium