This document describes a data modeling project for a music streaming startup called Sparkify, using Postgres to create a database optimized for song play analysis.
Abstract
The project involves creating a Postgres database with a star schema for song play analysis, using data from two directories: JSON files for log data and JSON files for song metadata. The data modeling process includes creating fact and dimension tables, such as songplays, users, songs, artists, and time. The ETL pipeline extracts data from the JSON files, transforms it into the required format, and loads it into the respective tables in the Postgres database. The document also provides a brief overview of data modeling and the star schema.
Opinions
The project aims to create a database optimized for song play analysis, which will help Sparkify analyze user activity data.
The star schema is chosen for its simplicity and fast aggregation for analytics.
The ETL pipeline is designed to extract data from JSON files, transform it into the required format, and load it into the respective tables in the Postgres database.
The songplays table records song plays associated with log data, while the users, songs, artists, and time tables provide additional information about users, songs, artists, and timestamps.
The document provides examples of the JSON data used in the project and SQL queries for inserting data into the tables.
The project is part of the Data Engineering Nanodegree program and will be used for upcoming projects.
The GitHub repo for the project is provided for reference.
This is the first project in Data Engineering Nanodegree. For this project, we have a music streaming startup called Sparkify who wants to analyze the data they’ve been collecting on their app regarding:
songs
user activity
This data currently resides in two directories:
JSON files for log data — folder name is 11 for November 2018
Our task is to model user activity data to create a Postgres database optimized for song play analysis. We will do this by creating a database schema and ETL pipeline.
To follow along in upcoming sections, please follow its GitHub repo for instructions.
What is Data Modeling?
Data modeling is a high level abstraction that organizes data and how they relate to each other. — Wikipedia
Ever planned a trip or recorded payments in an Excel sheet? That is data modeling. Think about the process you went through when creating the first few rows. You decided what columns to keep. Maybe the number of columns changed when you recorded more observations and thought of another important feature. But you follow certain rules when creating the Excel sheet.
Same is the case with databases. In database design, what you record will eventually end up as a database for an information system. So, data modeling is also called database modeling.
Here we will be performing data modeling in Postgres for Sparkify. We start with creating facts and dimension tables for a star schema. We will use the files in song_data and log_data directories for it.
Data
The data used in this project will be used for upcoming projects also. So, it is better to understand what it represents.
Song Dataset
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song’s track ID. For example, here are file paths to two files in this dataset.
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
Log Dataset
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from Sparkify app based on specified configurations.
The log files in the dataset we’ll be working with are partitioned by year and month. For example, here are file paths to two files in this dataset.
And below is an example of what the data in a log file, 2018–11–12-events.json, looks like.
If you would like to preview the JSON data within log_data directory, you will need to create a Pandas df to read the data.
Note: Make sure to set the argument lines = True in pd.read_json() . This will read every line of the JSON as a new row.
Schema for Song Play Analysis
Using the song and log datasets, we’ll create a star schema optimized for queries on song play analysis. But first let discuss in brief about star schema.
Star Schema
A star schema is the simplest style of data mart schema. The star schema consists of one or more fact tables referencing to any number of dimension tables. It has some advantages like fast aggregation for analytics, simple queries for JOINs, etc.
This includes the following tables:
Facts Table:
In data warehousing, a fact table consists of measurements, metrics or facts of a business process — Wikipedia
songplays — records in log data associated with song plays, i.e., records with page NextSong . This filter for page column specifies that user has played a song, like clicked on the nextsong button in the app.
A dimension table is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time. — Wikipedia
users — Followinginformation about users:
user_id, first_name, last_name, gender, level
2. songs — Following info aboutsongs:
song_id, title, artist_id, year, duration
3. artists — Artists information:
artist_id, name, location, latitude, longitude
4. time — Timestamp broken down into specific units:
start_time, hour, day, week, month, year, weekday
In order to create these tables, all we need to do is perform some transformation in the data which are already in song_data and log_data directory.
Process song data (song_data directory)
We will perform ETL on the files in song_data directory to create two dimensional tables: songs table and artists table.
This is what a songs file looks like:
For songs table, we’ll extract data for songs table by using only the columns corresponding to the songs table suggested in the star schema above. Similarly, we’ll select the appropriate columns for artists table.
Now insert the extract data into their respective tables.
Variables song_table_insert and artist_table_insert are SQL queries. These are given in sql_queries.py file.
Process log data (log_data directory)
We will perform ETL on the files in log_data directory to create the remaining two dimensional tables: time and users, as well as the songplays fact table.
This is what a single log file looks like:
For time table we have ts column in log files. We will parse it as a time stamp and use python’s datetime functions to create the remaining columns required for the table mentioned in the above schema.
For users table, we’ll extract the appropriate columns from log files as mentioned in the star schema above for users table.
For songplays table, we will require information from songs table, artists table and the original log files. Since the log files do not have song_id and artist_id, we need to use songs table and artists table for that. The song_select query finds the song_id and artist_id based on the title, artist_name, and duration of a song. For the remaining columns, we can select them from the log files.
Now insert the data into their respective tables.
Conclusion
That’s it. We created a Postgres database with the facts and dimension table for song_play analysis. We populated it with the entries from songs and events directory. Now our data is useful for some basic aggregation and analytics.
In the next blog, I will be discussing data modeling in Apache Cassandra — A NoSQL DBMS.