Day 27 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 27 of Data Engineering Series with Projects!

In this we will cover —

Power BI

Which chart to use and When?

Power BI — Data Analysis Expressions

Joins

Data Profiling

Pre-requisite to Day 27 is to complete Day 1–26( link below):

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Day 8 : Advanced Functions

Day 9 : Query Optimizations

Day 10 : MySQL and PostgreSQL

Day 11: Shell scripting and Linux “touch” command

Day 12 : Map Reduce, Data Warehouse, Data Lakes

Day 13: Pandas, Pandas, Data Cleaning and processing, Outlier Detection, Noisy Data, Missing Data, Pandas Functions, Aggregate Functions, Joins

Day 14 : Numpy

Day 15 : Advanced Pandas Techniques

Day 16 : Data Pre-processing, Handling missing values, Data Cleaning, Mean/mode/median Imputation, Hot Deck Imputation, Rescale Data, Binarize Data, Regression Imputation, Stochastic regression imputation, Feature Scaling

Day 17 : Data Augmentation, Read and Process Large Datasets

Day 18 : Data Visualization basics, Data Visualization Projects, Data Visualization using Plotly and Bokeh, Data Profiling, Summary Functions, Indexing, Grouping, Linear Regression, Multi Linear Regression, Polynomial Regression, Regression, Support Vector Regression, Decision Tree Regression, Random Forest Regression, Feature Engineering, GroupBy Features, Categorical and Numerical Features, Missing Value Analysis, Fill the missing Values, Unique Value Analysis, Univariate Analysis, Bivariate Analysis, Multivariate Analysis, Correlation Analysis, Spearman’s ρ, Pearson’s r, Kendall’s τ, Cramér’s V (φc), Phik (φk)

Day 19 : MySQL and PostgreSQL

Day 20 : ETL ( Extract, Tranform and Load) basics, Why ETL is important?, How ETL works, ETL Tools

Day 21 : Structured Data, Semi Structured Data, Unstructured Data, Data Warehouse, Data Mart, Data Lake

Day 22 :Big Data, Types of Big Data, Big data tools, SQL and NoSQL Databases, Hadoop, Hadoop HDFS, Hadoop Yarn

Day 23: Batch Processing, Stream Processing, Apache Spark, Apache Spark Commands, Apache Kafka, How Apache Kafka works

Day 24 : Hive, Zookeper, Pig, Cassandra, Sqoop

Day 25: Docker, Docker vs Virtual Machines, Most important Docker commands, Kubernetes, Snowflake

Day 26 : Data Pipelines, Transformation, Processing, Workflow, Monitoring, Airflow, DAG

Day 27 : Power BI, Which chart to use and When?, Power BI — Data Analysis Expressions, Joins, Data Profiling

Projects Videos —

Subscribe today!

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Ignito:

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication with hundreds of…

naina0405.substack.com

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Mega Compilation : Solved System Design Case studies

Let’s get started!

Data Analysis Expressions (DAX): DAX is a formula language used in Power BI to create custom calculations and aggregations. It is used to create calculated tables and columns, and can also be used to define custom aggregations and filters.
Joins: Power BI allows you to join data from multiple tables, which can be done in the Query Editor. The most common types of joins include inner join, left join, right join, and full outer join. The choice of join type depends on the specific needs of the analysis and the relationships between the tables.
Data Profiling: Data profiling is the process of analyzing the data to understand its structure, quality, and content. Power BI offers a Data Profiling feature, which provides a quick way to understand the data by displaying statistics such as the number of rows, columns, null values, and data types. It also allows you to identify outliers and anomalies in the data, which can be used to improve the quality of the data.

Code Implementation —

import pandas as pd
import numpy as np

# Sample data for demonstration
data1 = pd.DataFrame({'ID': [1, 2, 3, 4],
                      'Name': ['John', 'Jane', 'Alice', 'Bob'],
                      'Age': [25, 30, 35, 40]})

data2 = pd.DataFrame({'ID': [1, 2, 3, 5],
                      'Salary': [5000, 6000, 7000, 8000]})

# Perform join operation
merged_data = pd.merge(data1, data2, on='ID', how='inner')

# Perform data profiling
data_stats = merged_data.describe()

# Create calculated column using DAX
merged_data['Full Name'] = merged_data['Name'] + ' ' + merged_data['Age'].astype(str)

# Display the merged data with calculated column and data statistics
print("Merged Data:")
print(merged_data)

print("\nData Statistics:")
print(data_stats)

Snippet —

Pre-requisite —

Before starting, go through this post to understand charts/plots and which chart to use and when.

Power BI is a tool for data and analytics reporting which lets users create report insights using different visualizations and dashboards.

Power BI is a UI based tool which allows to integrate with any data source and lets you work with large amount of data.

There are three main views in Power BI —

Data View — datasets associated with the reports are examined.

Model View — lets you establish relationship between different data.

Report View — lets you visualize data and create the reports and dashboards

Upload the data in the Power BI —

Once the data is uploaded, understand which chart best represents your data.

To present your data, there are four basic presentation types :

Composition : To show part-to-whole relationship of the data variables

Distribution : To show the spread of the data values

Relationship : To establish relationship between the different data variables

Comparison : To compare one value with the other ( i.e two or more data variables)

Line Chart —

Line chart are used to show trends over the period time or categories i.e to show changes in one variable value relative to another..

Example :

Column Chart —

Column charts are used to show to show comparison between different variables or multiple categories over time. It’s plotted using vertical bars.

Example :

Stacked Column Chart —

Stacked Column Chart is used to show relative percentage of multiple data categories or variables in stacked columns. It’s plotted using vertical bars.

Example :

Pie Chart —

Pie charts are used to show data as a percentage of a whole i.e to let user compare the relationship between different categories/dimension in some context.

Example :

Donut Chart —

Just like pie chart but with a hole in the centre; donut chart is used to visualize the categories as arcs.

Example :

Area Chart —

Area Charts are used to present the accumulative value changes over time and draw attention to the total value across a trend.

Example :

Bar Chart —

Bar charts are used to show to show values across different data variables/categories where values are represented on the x-axis and categories on the y-axis.

Example :

Scatter Plot —

Scatter plot are used to show distribution, correlation analysis and clustering trends.

Example :

TreeMap —

It is used to visualize the categories with the colored rectangles which have size wrt to their values.

Maps —

It is used to map quantitative and categorical information for the different locations.

Data Analysis Expressions ( DAX)

It’s a language in Power BI which lets you create queries/calculations and do data analysis.

For aggregations, it uses below functions —

sum() : To compute the sum of a specific Column.
min() : To compute minimum value of each Column
max() : To compute maximum value of each Column
median() : To compute median of each column
count() : To count elements by elements.

For logical functions it uses IF clause.

For text Functions, it uses —

LOWER() : to convert text into lowercase

UPPER() : to convert text into uppercase

REPLACE(): To replace the old text with new text

Code Implementation —

import pandas as pd

# Sample data for demonstration
data = pd.DataFrame({'Name': ['John', 'Jane', 'Alice', 'Bob'],
                     'Age': [25, 30, 35, 40],
                     'Salary': [5000, 6000, 7000, 8000]})

# Aggregation functions
sum_salary = data['Salary'].sum()
min_age = data['Age'].min()
max_age = data['Age'].max()
median_salary = data['Salary'].median()
count_name = data['Name'].count()

print("Sum of Salary:", sum_salary)
print("Minimum Age:", min_age)
print("Maximum Age:", max_age)
print("Median Salary:", median_salary)
print("Count of Name:", count_name)

# Logical function (IF clause)
data['Salary Category'] = data['Salary'].apply(lambda x: 'High' if x > 6000 else 'Low')

print("\nData with Salary Category:")
print(data)

# Text functions
data['Name Lower'] = data['Name'].str.lower()
data['Name Upper'] = data['Name'].str.upper()
data['Name Replace'] = data['Name'].str.replace('a', 'e')

print("\nData with Text Functions:")
print(data)

Snippet —

JOINS

Before we start with joins, one must understand the points explained below —

Primary Keys : These are fields/keys in a table which uniquely identifies each record. These are used to join tables.

Foreign Keys : These are fields/keys in a table which the references the primary key of the other table.

Relationships : Relations between the tables can be one to one, one to many and many to many. In one to one relationships, a record in one table is uniquely related to exactly one record in the other table. On the other side, One to many relationships, a record in one table can be related to one or more records in the other table. Lastly, in many to many one or more records in one table are related to one or more records in the other table.

What is a Join?

A join is nothing but a construct used to combine rows from two or more tables based on a related/common column between them. It matches the related columns values in two or more tables.

INNER JOIN: Select records that have matching values in both tables.

LEFT JOIN: Select records from the first (left-most) table with matching right table records.

RIGHT JOIN: Select records from the second (right-most) table with matching left table records.

CROSS JOIN : Select records in the first table multiplied by the records in the second table.

FULL JOIN: Selects all records that match either left or right table records.

Format —

SELECT column_names

FROM table1 JOIN table2

ON column_name1 = column_name2

WHERE condition

Code Implementation —

-- Sample tables for demonstration
CREATE TABLE Customers (
  CustomerID INT PRIMARY KEY,
  CustomerName VARCHAR(50),
  City VARCHAR(50)
);

CREATE TABLE Orders (
  OrderID INT PRIMARY KEY,
  OrderDate DATE,
  CustomerID INT,
  FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

-- INNER JOIN: Select records that have matching values in both tables
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- LEFT JOIN: Select records from the first (left-most) table with matching right table records
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- RIGHT JOIN: Select records from the second (right-most) table with matching left table records
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
RIGHT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- CROSS JOIN: Select records in the first table multiplied by the records in the second table
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
CROSS JOIN Orders;

-- FULL JOIN: Selects all records that match either left or right table records
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
FULL JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- Cleanup: Drop the sample tables
DROP TABLE Orders;
DROP TABLE Customers;

Snippet —

Data Profiling

You can also profile your data in order to get information about your data in Power BI.

Check the Column distribution and profile to see summary statistics of the data.

Some of the most important PowerBI commands —

The “Get Data” command: This command is used to import data into Power BI from various sources such as Excel, CSV, SQL Server, and more.
The “Transform Data” command: This command, also known as the Query Editor, allows you to clean and transform your data before it is used in your Power BI report. This includes tasks such as filtering, pivoting, splitting columns, and changing data types.
The “Visualizations” command: This command allows you to create and customize various types of charts and visuals such as bar charts, line charts, pie charts, and more.
The “Measures” command: This command is used to create calculations and aggregations using DAX (Data Analysis Expressions) formulas. These measures can be used in visuals and tables to show the summarized data.
The “Format” command: This command allows you to customize the look and feel of your visuals, including things like color, font size, and axis labels.
The “Page layout” command: This command allows you to arrange and organize your visuals on a Power BI report page.
The “Publish” command: This command allows you to publish your Power BI report to the Power BI service, where it can be shared and accessed by others.
“Data Modeling” command : This command allows you to create and manage relationships between tables, create calculated columns and tables, and create hierarchies.
“Analyze” command : This command allows you to perform advanced analytics tasks such as forecasting, clustering, and key drivers analysis.
“Dashboard” command : This command allows you to create, manage, and share interactive dashboards that display multiple visuals and reports on a single screen.

A project video covering Power BI, Power BI — Data Analysis Expressions, Joins, Data Profiling coming soon ( subscribe today) —

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

That’s it for now.

Find Day 28 Below :

Day 28 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 28 of Data Engineering Series with Projects!

medium.com

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

All the Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

Complete-System-Design/README.md at main · Coder-World04/Complete-System-Design

This repository contains everything you need to become proficient in System Design Topics you should know in System…

github.com

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

medium.com

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

medium.datadriveninvestor.com

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

medium.com

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

medium.datadriveninvestor.com

Recurrent Neural Network with Keras

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

medium.datadriveninvestor.com

Clustering Geolocation Data in Python using DBSCAN and K-Means

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

medium.datadriveninvestor.com

Facial Expression Recognition using Keras

Facial Expression Recognition using Keras

Project Implementation…

medium.datadriveninvestor.com

Hyperparameter Tuning with Keras Tuner

Hyperparameter Tuning with Keras Tuner

Project Implementation….

medium.datadriveninvestor.com

Custom Layers in Keras

Custom Layers in Keras

Code implementation …

medium.datadriveninvestor.com

Day 27 of 30 days of Data Engineering Series with Projects

Power BI

Which chart to use and When?

Power BI — Data Analysis Expressions

Joins

Data Profiling

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

Tech Newsletter —

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication with hundreds of…

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Let’s get started!

TreeMap —

Maps —

Data Analysis Expressions ( DAX)

JOINS

What is a Join?

Data Profiling

A project video covering Power BI, Power BI — Data Analysis Expressions, Joins, Data Profiling coming soon ( subscribe today) —

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

That’s it for now.

Find Day 28 Below :

Day 28 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 28 of Data Engineering Series with Projects!

Read more —

All the Complete System Design Series Parts —

Github —

Complete-System-Design/README.md at main · Coder-World04/Complete-System-Design

This repository contains everything you need to become proficient in System Design Topics you should know in System…

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

For other projects, tune to —

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

Facial Expression Recognition using Keras

Project Implementation…

Hyperparameter Tuning with Keras Tuner

Project Implementation….

Custom Layers in Keras

Code implementation …