This context provides a tutorial on data modeling with Apache Cassandra, a NoSQL database management system, focusing on creating a database for a music streaming app to analyze user activity data.
Abstract
The content of this context revolves around data modeling with Apache Cassandra, a NoSQL database management system, and its advantages over relational database management systems for big data. It discusses the motivation behind using Apache Cassandra, specifically for a startup called Sparkify that wants to analyze data collected on songs and user activity on their new music streaming app. The tutorial covers setting up Apache Cassandra, creating a cluster, keyspace, and session, and modeling data based on the queries that will be performed. It also includes an ETL pipeline that migrates data from a directory of CSV files to the Apache Cassandra database. The tutorial emphasizes the importance of modeling data based on the queries that will be performed, as aggregation like GROUP BY and JOIN are highly discouraged in Cassandra. The tutorial concludes by mentioning that the entire events_data directory has been pre-processed and three tables have been created in the Apache Cassandra database.
Bullet points
Apache Cassandra is a NoSQL database management system that is built for high availability, scalability, and fault-tolerant systems.
Sparkify wants to analyze data collected on songs and user activity on their new music streaming app.
Apache Cassandra is used to create a database for this analysis.
The tutorial covers setting up Apache Cassandra, creating a cluster, keyspace, and session.
Data modeling in Apache Cassandra is based on the queries that will be performed.
Aggregation like GROUP BY and JOIN are highly discouraged in Cassandra.
The tutorial includes an ETL pipeline that migrates data from a directory of CSV files to the Apache Cassandra database.
The entire events_data directory has been pre-processed and three tables have been created in the Apache Cassandra database.
The tutorial concludes by mentioning that the next topic will be Cloud Data Warehousing.
In the previous post, I talked about data modeling with Postgres and how to create an ETL pipeline that migrates data from a directory of CSV files to a Postgres database.
In this post, I will dive into data modeling with Apache Cassandra, a NoSQL database management system.
So why do we need NoSQL anyway?
There are some drawbacks of the relational database management systems when it comes to big data because you need:
high availability
horizontal scaling
fast throughputs: reads and writes
flexibility
Apache Cassandra is built for high availability, scalability, and fault-tolerant systems.
Introduction
A startup called Sparkify wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.
They’d like a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions. Our role is to create a database for this analysis. We’ll be able to test the database by running queries given by the analytics team at Sparkify.
To follow along with the upcoming instruction, please view it’s GitHub repo.
Starting with Apache Cassandra
Before we perform ETL and create an Apache Cassandra database, we need to set it up for running on our machine. This is a three-step process:
Create a Cluster
This will make a connection to a Cassandra instance on our local machine. To establish a connection and begin executing queries, we create a session .
Create a Keyspace
This is similar to creating a database in Postgres where we specify the host and user privileges.
Set Keyspace
This is very similar to what we used when we connected to the Postgres database in the previous project and got a cursor to it like this:
Data Modeling
In Apache Cassandra, we model our data based on the queries we will perform. Aggregation like GROUP BY, JOIN are highly discouraged in Cassandra. This is because we shouldn’t scan the entire data because it is distributed on multiple nodes. It will slow down our system because sending all of that data from multiple nodes to a single machine will crash it.
Now we will create tables for the following queries:
Part 1: Give me the artist, song title and song’s length in the music app history that was heard during sessionId = 338, and itemInSession = 4
This creates the first table and we set the partition key and clustering column based on the last part of the question. These two columns together form the primary key. Here is an introduction on primary key, partition key and clustering columns.
Transformation and ETL part
Here we read every line of the CSV file and extract appropriate fields to INSERT data into the table discography.
Let’s see if the ETL works correctly.
In the same way, we create tables to optimize for the other two queries.
Part 2: Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
This creates the second table and we set the Partition Key as userId, sessionIdand clustering column from the condition set in the question.
Transformation and ETL part
Here we read every line of the CSV file and extract appropriate fields to INSERT into the table user_artist_relation.
Let’s see if the ETL works correctly.
Part 3: Give me every user name (first and last) in my music app history who listened to the song ‘All Hands Against His Own’
This creates the third table and we set the Primary Key as the two columns: songand userId.
Transformation and ETL part
Here we read every line of the CSV file and extract appropriate fields to INSERT into the table user_info.
Let’s see if the ETL works correctly.
Conclusion
We pre-processed the entire events_data directory and created three tables in the Apache Cassandra database. As mentioned earlier, we model our tables based on the queries we will perform on them. So, to ensure maximum availability and fast throughputs, we make separate tables for different types of queries.
If you want to learn more about the Postgres data modeling prior to this, you can read more about it here. Next we will discuss on Cloud Data Warehousing.