avatarBogdan Cojocar

Summary

The provided content is a step-by-step guide on how to read JSON files from Amazon S3 using PySpark within a Jupyter notebook environment.

Abstract

The article outlines a concise tutorial for users with prerequisite knowledge of PySpark and Jupyter. It begins by instructing the reader to ensure the Hadoop AWS package is available upon loading Spark, followed by integrating PySpark with Jupyter notebooks. The guide then details the process of accessing AWS credentials using the configparser package to read from the standard AWS credentials file. Subsequently, it explains how to pass these credentials to the Hadoop configuration within the Spark session. The final step demonstrates how to read JSON data from an S3 bucket and display it using PySpark's DataFrame API. Additionally, the article references another tutorial for reading parquet data from S3 using the S3A protocol.

Opinions

  • The author assumes the reader has prior experience with PySpark and Jupyter, suggesting a target audience of data professionals or enthusiasts.
  • The guide emphasizes the importance of proper configuration for Hadoop AWS integration, indicating a common pitfall for users.
  • The use of configparser to manage AWS credentials implies a preference for leveraging existing AWS tooling for security and convenience.
  • By providing code snippets for each step, the author conveys a practical approach to problem-solving, catering to readers who prefer a hands-on learning style.
  • The mention of a related tutorial on reading parquet data suggests that the author values comprehensive learning resources and cross-referencing related topics for a more holistic understanding.

How to read JSON files from S3 using PySpark and the Jupyter notebook

This is a quick step by step tutorial on how to read JSON files from S3.

Prerequisites for this guide are pyspark and Jupyter installed on your system. Please follow this medium post on how to install and configure them.

Step 1

First, we need to make sure the Hadoop aws package is available when we load spark:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"

Step 2

Next, we need to make pyspark available in the jupyter notebook:

import findspark
findspark.init()
from pyspark.sql import SparkSession

Step 3

We need the aws credentials in order to be able to access the s3 bucket. We can use the configparser package to read the credentials from the standard aws file.

import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_id = config.get(aws_profile, "aws_access_key_id") 
access_key = config.get(aws_profile, "aws_secret_access_key")

Step 4

We can start the spark session and pass the aws credentials to the hadoop configuration:

sc=spark.sparkContext
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)

Step 5

Finally, we can read the data and display it:

df=spark.read.json("s3n://your_file.json")
df.show()

Another tutorial on reading parquet data on S3A with Spark can be found here.

Spark
Jupyter Notebook
Python
Recommended from ReadMedium