Summary

This web content provides a concise guide on setting up a Jupyter Notebook with Amazon EMR to run Apache Spark jobs efficiently for data analysis.

Abstract

The article titled "Setup Jupyter Notebook with EMR to run spark job in 5 minutes" is a tutorial aimed at quickly enabling users to perform ad-hoc data analysis using Apache Spark on AWS EMR through Jupyter Notebooks. It begins by introducing the Jupyter Notebook as a server-client application for running notebook documents via a web browser, Apache Spark as a powerful engine for data processing, and AWS EMR as a tool for big data analysis on AWS. The guide emphasizes the synergy between these tools to provide an efficient environment for data science workloads. The step-by-step instructions cover logging in as an IAM user, searching for EMR on the AWS console, creating a new notebook, setting up a cluster, and finally running a sample Spark program to calculate Pi. The process is designed to be straightforward, allowing users to get started with minimal configuration.

Opinions

The tutorial promotes the use of Jupyter Notebooks with AWS EMR and Apache Spark for their combined efficiency in data analysis tasks.
It suggests that using an IAM user is a best practice for security and management purposes, as opposed to using the root account.
The guide conveys that AWS EMR simplifies the process of setting up a Spark cluster, making it accessible to users who may not have extensive experience with cluster computing.
The inclusion of a sample Spark program to calculate Pi demonstrates the practicality of using Jupyter Notebooks for real-world data science problems.
The article implies that the integration of these technologies provides a user-friendly platform for running Spark jobs, which can be particularly beneficial for ad-hoc data analysis.

Setup Jupyter Notebook with EMR to run spark job in 5 minutes

Often we like to do ad-hoc data analysis using spark. This tutorial guides us to quickly get started with Jupyter notebook with EMR to run spark jobs

Let’s look briefly what Jupyter notebook,apache spark and AWS EMR are :

The Jupyter Notebook App is a server-client application that allows editing and running notebook documents via a web browser. The Jupyter Notebook App can be executed on a local desktop requiring no internet access (as described in this document) or can be installed on a remote server and accessed through the internet

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Amazon EMR (previously known as Amazon Elastic MapReduce) is an Amazon Web Services (AWS) tool for big data processing and analysis. Amazon markets EMR as an expandable, low-configuration service that provides an alternative to running on-premises cluster computing.

Together ,Jupyter notebook, apache spark and AWS EMR provides users a efficient and quick way to do data analysis and run data science workloads.

Let’s get started:

Open the login page for IAM user.

The url is as below:

https://<<account-ID-or-alias>>.signin.aws.amazon.com/console

You can find the sign-in URL for an account on the Dashboard page in the IAM console.

We need to follow below steps with IAM user and not root account

You can follow steps in below link in case you have not already created IAM user for your account : https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html

2.Log in

3.Search EMR from top left search bar:

4.Navigate to Notebooks and Create notebook:

5.Fill below properties:

Name: Any name you wish

Cluster:Choose Create a cluster and select default options

Press Create button

6.Now your notebook and underlying EMR cluster are getting ready.

7.Once ready, click on “Open in JupyterLab”.You will brand new Notebook.

8.Double click on <>.ipynb and select “PySpark” as kernel

9.Note book is ready to use

10. Let's run our first spark program on this :

Below is code for calculating pi:

import sys
from random import random
from operator import add

"""
    Usage: pi [partitions]
"""

partitions = 1
n = 100000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count/n))

Copy paste on the notebook and run it.

Voila ! We just successfully ran our first program on notebook.