Four Data Engineering Projects That Look Great on your CV
Data pipelines that would turn you into a decorated data professional
In this story, I would like to speak about data engineering career paths and data projects that look great on any CV. If you are an aspiring data practitioner not only willing to learn new tools and techniques but also aiming to build your own data projects portfolio — this article is for you. During my more than 15 years career in data and analytics, I witnessed good and bad CVs showcasing data engineering skills. Data engineering projects you were involved in or responsible for are the ultimate compass that tells a recruiter about your experience, how good you are and why they should hire you. This story is about how to present your data engineering experience in CVs and deliver that sense of professionalism and confidence that buy-in.
Starting a new data engineering project is always challenging as data engineering is probably the most difficult job role in the data space. You need to be a software engineer — to know how to build data pipelines, then you need to be a data analyst — to communicate efficiently with analytics teams using SQL, and in the end, you need to be an experienced data platform architect to manage all required infrastructure resources. It’s definitely worth the risk to start learning it! Data engineering is the fastest growing occupation according to DICE research conducted in 2023 and I previously wrote about it in one of my previous articles [1].
If you are learning data engineering this article might be useful for you because choosing the right set of projects helps to develop required skills faster. In this article above I wrote about the skills that look great on your CV. I think now is the time to speak about the data engineering project portfolio. It is so easy to get stuck choosing the right tools or finding the right data for your project. Let’s see what we can do to make your CV look more professional data-wise and maybe create a mini data engineering roadmap to learn a few things this year.
I noticed that beginner and even intermediate-level users often struggle in these three main areas while contemplating a new data engineering project:
- Choosing the right dataset for your project
- Picking the right tools to work with data
- Data pipeline orchestration
When you are thinking of starting a new data engineering project I recommend considering it as a data platform you are building from scratch (or at least a part of it). Imagine that you’ve been just hired and onboarded. Try to think of questions you would want to ask your new employers about the data stack, and business and functional requirements, and try to envision all the potential pitfalls and challenges we might face during the data platform design. In any data platform data flow is predictable:
Ultimately there are a few main things to focus on while architecting the data platform [2]:
- Data sources (APIs, relational and non-relational databases, integrations, event streams)
- Data ingestion — this is complicated, more ideas can be found below.
- Data storage — mostly for lake house architecture types.
- Data orchestration — data pipeline triggers, flow charts and directed acyclic graphs (DAGs)
- Data resources — provisioning and management.
- Data warehousing — tools and techniques.
- Business intelligence (BI) — reporting and dashboards.
- Machine learning and MLOps
Each of these areas must be mentioned in your data engineering CV to make it look complete. It would be great to create some sort of a visual to present the final deliverable too. It can be a website or a simple dashboard, e.g. Looker Studio dashboard that we can share. It is always a great idea to supply it with comments and annotations.
Data sources
It all starts with data sources. Indeed, any data pipeline starts somewhere. Whenever we transform data from point A to point B — there is a data pipeline [3]. A data pipeline’s three major parts are a source, a processing step or steps, and a destination. We can extract data from an external API (a source) and then be loaded into the data warehouse (destination). This is an example of a most common data pipeline where the source and destination are different.
A simple data connector built with MySQL database can demonstrate that you know how to extract data — the basic ETL technique. Very often data engineers are tasked to extract really huge amounts of data from databases (or any other data source). It is crucial to convey a feeling that you know how to use code and do it efficiently, e.g. using Python generator and yield Consider this code snippet below as it explains how to process the data using generators:
# Create a file first: ./very_big_file.csv as:
# transaction_id,user_id,total_cost,dt
# 1,John,10.99,2023-04-15
# 2,Mary, 4.99,2023-04-12
# Example.py
def etl(item):
# Do some etl here
return item.replace("John", '****')
# Create a generator
def batch_read_file(file_object, batch_size=19):
"""Lazy function (generator) can read a file in chunks.
Default chunk: 1024 bytes."""
while True:
data = file_object.read(batch_size)
if not data:
break
yield data
# and read in chunks
with open('very_big_file.csv') as f:
for batch in batch_read_file(f):
print(etl(batch))
# In command line run
# Python example.py
So the project can look like that:
The tutorial explains how to:
- export data from MySQL efficiently with
stream
and save locally as CSV - export data from MySQL and pipe that stream to GCP’s Cloud Storage or AWS S3
This is one of the most common data pipeline designs and MySQL is probably the most popular relational database that can be easily deployed locally or in the cloud. You can try one of my previous stories to do it.
Data ingestion, streams and data warehousing
The way we load data into the data warehouse or into the landing area in our data lake is probably the most difficult part of any data engineering project as it would involve a producer. Basically we need something to generate data — it can be a microservice or a containerised application, e.g. Docker, etc. It can be simple AWS Lambda function sending event data. For example consider this code snippet. It mocks some fake data and forward it into Kinesis stream [4].
# Make sure boto3 is installed locally, i.e. pip install boto3
import json
import random
import boto3
kinesis_client = boto3.client('kinesis', region_name='eu-west-1')
# Constants:
STREAM_NAME = "your-data-stream-staging"
def lambda_handler(event, context):
processed = 0
print(STREAM_NAME)
try:
print('Trying to send events to Kinesis...')
for i in range(0, 5):
data = get_data()
print(i, " : ", data)
kinesis_client.put_record(
StreamName=STREAM_NAME,
Data=json.dumps(data),
PartitionKey="partitionkey")
processed += 1
except Exception as e:
print(e)
message = 'Successfully processed {} events.'.format(processed)
return {
'statusCode': 200,
'body': { 'lambdaResult': message }
}
```
We would like to add a helper function to generate some random event data. For instance:
```python
# Helpers:
def get_data():
return {
'event_time': datetime.now().isoformat(),
'event_name': random.choice(['JOIN', 'LEAVE', 'OPEN_CHAT', 'SUBSCRIBE', 'SEND_MESSAGE']),
'user': round(random.random() * 100)}
Here we go! This is our second data engineering project that demonstrates the following:
- we know how to mock data using Python
- we are familiar with serverless architecture
- we are confident with event streams
- we know basic ETL techniques
There is an Infrastructure as code part of this project too. I think this is one of the main things I would look into when hiring mid-level data engineers.
Data storage, machine learning and orchestration
Here I would demonstrate that I am familiar with different data platform architecture types, e.g. data warehouse, lake house, data lake and data mesh. I remember a couple of years ago the internet was boiling with “Hasdoop is dead” type stories. There was a noticeable shift towards data warehouse architecture. In 2024 everyone seems to be obsessed with real-time data streaming and scalability suggesting Spark and Kafka soon to become the public benchmark leaders. Indeed, processing huge amounts of data in the data lake might be way more efficient using distributed computing.
All we need is to demonstrate that we are familiar with Spark (PySpark) for example and that we have experience working with cloud storage providers. The main three are AWS, Goole and Azure.
So our third project might look like this [5]:
In this tutorial, we will use public Movielens datasets with movie ratings to build the recommendation system. So the steps would be the following:
- preserve movie data in cloud storage, e.g.
# user_ratedmovies-timestamp.dat
userID movieID rating timestamp
75 3 1 1162160236000
75 32 4.5 1162160624000
75 110 4 1162161008000
- transform datasets into a conformed model using AWS Glue
- orchestrate the pipeline using AWS Step Functions
This is not only a good project to demonstrate the knowledge of data platform and data pipeline design but also to learn infrastructure as code (AWS CloudFormation) and Machine Learning (ML) techniques with AWS Personalize.
Data warehousing and machine learning
Machine learning is an important part of data engineering. MLOps and skills required to manage machine learning pipelines are essential for data engineers. So why not demonstrate that we are confident with both?
A good example of such a project might include the analysis of user behaviour data to predict user propensity to churn. This is not a trivial task per se and building a project like this one requires a good understanding of marketing event data and basic concepts of user retention. So if you are capable of finishing it then you’ll have it all! We can focus on dataset preparation and model training using standard SQL. It demonstrates our good knowledge of SQL techniques.
Indeed, retention is an important business metric that helps understand user behaviour’s mechanics. It provides a high-level overview of how successful our Application is in terms of retaining users by answering one simple question: Is our App good enough at retaining users? It is a well-known fact that it’s cheaper to retain an existing user than to acquire a new one. So building a data pipeline like this might be a great way to learn these concepts:
Then we can build a simple classification model using BigQuery ML like so:
SELECT
user_pseudo_id,
churned,
predicted_churned,
predicted_churned_probs[OFFSET(0)].prob as probability_churned
FROM
ML.PREDICT(MODEL sample_churn_model.churn_model,
(SELECT * FROM sample_churn_model.churn)) #can be replaced with a proper test dataset
order by 3 desc
;
It will forecast the probability (propensity to churn) — and the closer this probability is to 1 the more likely this user will not return to the App according to the model’s prediction:
It is true that acting on machine learning (ML) model data to retain users proved itself extremely useful and might help to gain a competitive advantage in the fast-changing market environment.
That’s why it is a great candidate to include in our CV!
Conclusion
Starting a new data engineering project is always challenging and in this story, I tried to focus on some ideas that might look great on any CV. Often I find myself struggling with the choice of the dataset. When I get stuck I simply start to mock data myself — this should not be a blocker. Alternatively, we can use datasets that are publically available for free, e.g. movielens and Google Analytics events. They are just great for staging purposes due to their sizes. Choosing the right tools for your data engineering projects depends on functional and business requirements. Here I would recommend coming up with a scenario and playing with your imagination. That’s why I love tech — everything is entirely possible!
In this article, I shared four data engineering projects that cover some crucial areas of data platform design — data pipeline orchestration, data warehousing, data ingestion and machine learning. These are the real projects I was involved and I translated them into tutorials. I hope you find it useful.
Recommended read
[1] https://towardsdatascience.com/how-to-become-a-data-engineer-c0319cb226c2
[2] https://towardsdatascience.com/data-platform-architecture-types-f255ac6e0b7
[3] https://towardsdatascience.com/data-pipeline-design-patterns-100afa4b93e3
[5] https://readmedium.com/orchestrate-machine-learning-pipelines-with-aws-step-functions-d8216a899bd5
[6] https://readmedium.com/python-for-data-engineers-f3d5db59b6dd