How to read JSON files from S3 using PySpark and the Jupyter notebook

Summary

The provided content is a step-by-step guide on how to read JSON files from Amazon S3 using PySpark within a Jupyter notebook environment.

Abstract

The article outlines a concise tutorial for users with prerequisite knowledge of PySpark and Jupyter. It begins by instructing the reader to ensure the Hadoop AWS package is available upon loading Spark, followed by integrating PySpark with Jupyter notebooks. The guide then details the process of accessing AWS credentials using the configparser package to read from the standard AWS credentials file. Subsequently, it explains how to pass these credentials to the Hadoop configuration within the Spark session. The final step demonstrates how to read JSON data from an S3 bucket and display it using PySpark's DataFrame API. Additionally, the article references another tutorial for reading parquet data from S3 using the S3A protocol.

Opinions

The author assumes the reader has prior experience with PySpark and Jupyter, suggesting a target audience of data professionals or enthusiasts.
The guide emphasizes the importance of proper configuration for Hadoop AWS integration, indicating a common pitfall for users.
The use of configparser to manage AWS credentials implies a preference for leveraging existing AWS tooling for security and convenience.
By providing code snippets for each step, the author conveys a practical approach to problem-solving, catering to readers who prefer a hands-on learning style.
The mention of a related tutorial on reading parquet data suggests that the author values comprehensive learning resources and cross-referencing related topics for a more holistic understanding.

Step 3

We need the aws credentials in order to be able to access the s3 bucket. We can use the configparser package to read the credentials from the standard aws file.

import configparser

config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))

access_id = config.get(aws_profile, "aws_access_key_id") 
access_key = config.get(aws_profile, "aws_secret_access_key")

Step 4

We can start the spark session and pass the aws credentials to the hadoop configuration:

sc=spark.sparkContext

hadoop_conf=sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)

How to read JSON files from S3 using PySpark and the Jupyter notebook

Step 1

Step 2

Step 3

Step 4

Step 5