avatarNaina Chaturvedi

Summary

Day 27 of the "30 days of Data Engineering Series with Projects" covers Power BI, including when to use different charts, Data Analysis Expressions (DAX), joins, and data profiling, with prerequisites and project videos for hands-on learning.

Abstract

The 30-day series on data engineering continues with a focus on Power BI for Day 27. The content delves into the appropriate use of various charts within Power BI, the application of DAX for custom calculations, the implementation of different types of joins to combine data from multiple sources, and the process of data profiling to understand data structure and quality. Pre-requisites for Day 27 include completing Days 1 through 26, ensuring a solid foundation before tackling the advanced topics presented. The article also teases upcoming project videos that will provide practical exercises to reinforce the concepts covered. The comprehensive approach aims to equip readers with the skills necessary to effectively analyze and visualize data using Power BI.

Opinions

  • The author emphasizes the importance of understanding which chart to use and when, suggesting that the correct visualization is crucial for data analysis.
  • There is a clear emphasis on the practical application of knowledge, with the inclusion of project videos and hands-on exercises to complement the theoretical learning.
  • The article suggests that proficiency in Power BI, including the use of DAX and joins, is a valuable skill for data engineers.
  • The structured approach of the series, with each day building on the previous, indicates the author's belief in incremental learning for complex subjects like data engineering.
  • By providing a mega compilation of solved system design case studies, the author seems to value a repository of resources for easy access and review.
  • The mention of subscribing to the newly launched YouTube channel "Ignito" for more content suggests the author's commitment to continuous education and community engagement in the tech field.

Day 27 of 30 days of Data Engineering Series with Projects

Pic credits MSLearn

Welcome back peeps to Day 27 of Data Engineering Series with Projects!

In this we will cover —

Power BI

Which chart to use and When?

Power BI — Data Analysis Expressions

Joins

Data Profiling

Pre-requisite to Day 27 is to complete Day 1–26( link below):

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Day 8 : Advanced Functions

Day 9 : Query Optimizations

Day 10 : MySQL and PostgreSQL

Day 11: Shell scripting and Linux “touch” command

Day 12 : Map Reduce, Data Warehouse, Data Lakes

Day 13: Pandas, Pandas, Data Cleaning and processing, Outlier Detection, Noisy Data, Missing Data, Pandas Functions, Aggregate Functions, Joins

Day 14 : Numpy

Day 15 : Advanced Pandas Techniques

Day 16 : Data Pre-processing, Handling missing values, Data Cleaning, Mean/mode/median Imputation, Hot Deck Imputation, Rescale Data, Binarize Data, Regression Imputation, Stochastic regression imputation, Feature Scaling

Day 17 : Data Augmentation, Read and Process Large Datasets

Day 18 : Data Visualization basics, Data Visualization Projects, Data Visualization using Plotly and Bokeh, Data Profiling, Summary Functions, Indexing, Grouping, Linear Regression, Multi Linear Regression, Polynomial Regression, Regression, Support Vector Regression, Decision Tree Regression, Random Forest Regression, Feature Engineering, GroupBy Features, Categorical and Numerical Features, Missing Value Analysis, Fill the missing Values, Unique Value Analysis, Univariate Analysis, Bivariate Analysis, Multivariate Analysis, Correlation Analysis, Spearman’s ρ, Pearson’s r, Kendall’s τ, Cramér’s V (φc), Phik (φk)

Day 19 : MySQL and PostgreSQL

Day 20 : ETL ( Extract, Tranform and Load) basics, Why ETL is important?, How ETL works, ETL Tools

Day 21 : Structured Data, Semi Structured Data, Unstructured Data, Data Warehouse, Data Mart, Data Lake

Day 22 :Big Data, Types of Big Data, Big data tools, SQL and NoSQL Databases, Hadoop, Hadoop HDFS, Hadoop Yarn

Day 23: Batch Processing, Stream Processing, Apache Spark, Apache Spark Commands, Apache Kafka, How Apache Kafka works

Day 24 : Hive, Zookeper, Pig, Cassandra, Sqoop

Day 25: Docker, Docker vs Virtual Machines, Most important Docker commands, Kubernetes, Snowflake

Day 26 : Data Pipelines, Transformation, Processing, Workflow, Monitoring, Airflow, DAG

Day 27 : Power BI, Which chart to use and When?, Power BI — Data Analysis Expressions, Joins, Data Profiling

Projects Videos —

All the projects, data structures, SQL, algorithms, system design, Data Science and ML , Data Analytics, Data Engineering, , Implemented Data Science and ML projects, Implemented Data Engineering Projects, Implemented Deep Learning Projects, Implemented Machine Learning Ops Projects, Implemented Time Series Analysis and Forecasting Projects, Implemented Applied Machine Learning Projects, Implemented Tensorflow and Keras Projects, Implemented PyTorch Projects, Implemented Scikit Learn Projects, Implemented Big Data Projects, Implemented Cloud Machine Learning Projects, Implemented Neural Networks Projects, Implemented OpenCV Projects,Complete ML Research Papers Summarized, Implemented Data Analytics projects, Implemented Data Visualization Projects, Implemented Data Mining Projects, Implemented Natural Leaning Processing Projects, MLOps and Deep Learning, Applied Machine Learning with Projects Series, PyTorch with Projects Series, Tensorflow and Keras with Projects Series, Scikit Learn Series with Projects, Time Series Analysis and Forecasting with Projects Series, ML System Design Case Studies Series videos will be published on our youtube channel ( just launched).

Subscribe today!

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Ignito:

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Let’s get started!

  1. Data Analysis Expressions (DAX): DAX is a formula language used in Power BI to create custom calculations and aggregations. It is used to create calculated tables and columns, and can also be used to define custom aggregations and filters.
  2. Joins: Power BI allows you to join data from multiple tables, which can be done in the Query Editor. The most common types of joins include inner join, left join, right join, and full outer join. The choice of join type depends on the specific needs of the analysis and the relationships between the tables.
  3. Data Profiling: Data profiling is the process of analyzing the data to understand its structure, quality, and content. Power BI offers a Data Profiling feature, which provides a quick way to understand the data by displaying statistics such as the number of rows, columns, null values, and data types. It also allows you to identify outliers and anomalies in the data, which can be used to improve the quality of the data.

Code Implementation —

import pandas as pd
import numpy as np

# Sample data for demonstration
data1 = pd.DataFrame({'ID': [1, 2, 3, 4],
                      'Name': ['John', 'Jane', 'Alice', 'Bob'],
                      'Age': [25, 30, 35, 40]})

data2 = pd.DataFrame({'ID': [1, 2, 3, 5],
                      'Salary': [5000, 6000, 7000, 8000]})

# Perform join operation
merged_data = pd.merge(data1, data2, on='ID', how='inner')

# Perform data profiling
data_stats = merged_data.describe()

# Create calculated column using DAX
merged_data['Full Name'] = merged_data['Name'] + ' ' + merged_data['Age'].astype(str)

# Display the merged data with calculated column and data statistics
print("Merged Data:")
print(merged_data)

print("\nData Statistics:")
print(data_stats)

Snippet —

Pre-requisite —

Before starting, go through this post to understand charts/plots and which chart to use and when.

Power BI is a tool for data and analytics reporting which lets users create report insights using different visualizations and dashboards.

Pic credits : cynotech

Power BI is a UI based tool which allows to integrate with any data source and lets you work with large amount of data.

There are three main views in Power BI —

Data View — datasets associated with the reports are examined.

Model View — lets you establish relationship between different data.

Report View — lets you visualize data and create the reports and dashboards

Upload the data in the Power BI

Pic credits : Pinterest

Once the data is uploaded, understand which chart best represents your data.

To present your data, there are four basic presentation types :

Composition : To show part-to-whole relationship of the data variables

Distribution : To show the spread of the data values

Relationship : To establish relationship between the different data variables

Comparison : To compare one value with the other ( i.e two or more data variables)

Line Chart —

Line chart are used to show trends over the period time or categories i.e to show changes in one variable value relative to another..

Example :

Pic credits : Google dev

Column Chart —

Column charts are used to show to show comparison between different variables or multiple categories over time. It’s plotted using vertical bars.

Example :

Pic credits : excelest

Stacked Column Chart —

Stacked Column Chart is used to show relative percentage of multiple data categories or variables in stacked columns. It’s plotted using vertical bars.

Example :

Pic credits : OriginalLabs

Pie Chart —

Pie charts are used to show data as a percentage of a whole i.e to let user compare the relationship between different categories/dimension in some context.

Example :

Pic credits : ResearchGate

Donut Chart —

Just like pie chart but with a hole in the centre; donut chart is used to visualize the categories as arcs.

Example :

Pic credits : fintrain

Area Chart —

Area Charts are used to present the accumulative value changes over time and draw attention to the total value across a trend.

Example :

Pic credits : Anycharts

Bar Chart —

Bar charts are used to show to show values across different data variables/categories where values are represented on the x-axis and categories on the y-axis.

Example :

Pic credits : displayr

Scatter Plot —

Scatter plot are used to show distribution, correlation analysis and clustering trends.

Example :

Pic credits : DataViz

TreeMap —

It is used to visualize the categories with the colored rectangles which have size wrt to their values.

Pic credits : MS Learn

Maps —

It is used to map quantitative and categorical information for the different locations.

Pic credits : SQLshack

Data Analysis Expressions ( DAX)

It’s a language in Power BI which lets you create queries/calculations and do data analysis.

For aggregations, it uses below functions —

  • sum() : To compute the sum of a specific Column.
  • min() : To compute minimum value of each Column
  • max() : To compute maximum value of each Column
  • median() : To compute median of each column
  • count() : To count elements by elements.

For logical functions it uses IF clause.

For text Functions, it uses —

LOWER() : to convert text into lowercase

UPPER() : to convert text into uppercase

REPLACE(): To replace the old text with new text

Code Implementation —

import pandas as pd

# Sample data for demonstration
data = pd.DataFrame({'Name': ['John', 'Jane', 'Alice', 'Bob'],
                     'Age': [25, 30, 35, 40],
                     'Salary': [5000, 6000, 7000, 8000]})

# Aggregation functions
sum_salary = data['Salary'].sum()
min_age = data['Age'].min()
max_age = data['Age'].max()
median_salary = data['Salary'].median()
count_name = data['Name'].count()

print("Sum of Salary:", sum_salary)
print("Minimum Age:", min_age)
print("Maximum Age:", max_age)
print("Median Salary:", median_salary)
print("Count of Name:", count_name)

# Logical function (IF clause)
data['Salary Category'] = data['Salary'].apply(lambda x: 'High' if x > 6000 else 'Low')

print("\nData with Salary Category:")
print(data)

# Text functions
data['Name Lower'] = data['Name'].str.lower()
data['Name Upper'] = data['Name'].str.upper()
data['Name Replace'] = data['Name'].str.replace('a', 'e')

print("\nData with Text Functions:")
print(data)

Snippet —

JOINS

Before we start with joins, one must understand the points explained below —

Primary Keys : These are fields/keys in a table which uniquely identifies each record. These are used to join tables.

Foreign Keys : These are fields/keys in a table which the references the primary key of the other table.

Relationships : Relations between the tables can be one to one, one to many and many to many. In one to one relationships, a record in one table is uniquely related to exactly one record in the other table. On the other side, One to many relationships, a record in one table can be related to one or more records in the other table. Lastly, in many to many one or more records in one table are related to one or more records in the other table.

What is a Join?

A join is nothing but a construct used to combine rows from two or more tables based on a related/common column between them. It matches the related columns values in two or more tables.

Pic credits : Alphacode

INNER JOIN: Select records that have matching values in both tables.

LEFT JOIN: Select records from the first (left-most) table with matching right table records.

RIGHT JOIN: Select records from the second (right-most) table with matching left table records.

CROSS JOIN : Select records in the first table multiplied by the records in the second table.

FULL JOIN: Selects all records that match either left or right table records.

Format —

SELECT column_names

FROM table1 JOIN table2

ON column_name1 = column_name2

WHERE condition

Code Implementation —

-- Sample tables for demonstration
CREATE TABLE Customers (
  CustomerID INT PRIMARY KEY,
  CustomerName VARCHAR(50),
  City VARCHAR(50)
);

CREATE TABLE Orders (
  OrderID INT PRIMARY KEY,
  OrderDate DATE,
  CustomerID INT,
  FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

-- INNER JOIN: Select records that have matching values in both tables
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- LEFT JOIN: Select records from the first (left-most) table with matching right table records
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- RIGHT JOIN: Select records from the second (right-most) table with matching left table records
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
RIGHT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- CROSS JOIN: Select records in the first table multiplied by the records in the second table
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
CROSS JOIN Orders;

-- FULL JOIN: Selects all records that match either left or right table records
SELECT Customers.CustomerID, Customers.CustomerName, Orders.OrderID, Orders.OrderDate
FROM Customers
FULL JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

-- Cleanup: Drop the sample tables
DROP TABLE Orders;
DROP TABLE Customers;

Snippet —

Data Profiling

You can also profile your data in order to get information about your data in Power BI.

Check the Column distribution and profile to see summary statistics of the data.

Pic credits : Power BI docs

Some of the most important PowerBI commands —

  1. The “Get Data” command: This command is used to import data into Power BI from various sources such as Excel, CSV, SQL Server, and more.
  2. The “Transform Data” command: This command, also known as the Query Editor, allows you to clean and transform your data before it is used in your Power BI report. This includes tasks such as filtering, pivoting, splitting columns, and changing data types.
  3. The “Visualizations” command: This command allows you to create and customize various types of charts and visuals such as bar charts, line charts, pie charts, and more.
  4. The “Measures” command: This command is used to create calculations and aggregations using DAX (Data Analysis Expressions) formulas. These measures can be used in visuals and tables to show the summarized data.
  5. The “Format” command: This command allows you to customize the look and feel of your visuals, including things like color, font size, and axis labels.
  6. The “Page layout” command: This command allows you to arrange and organize your visuals on a Power BI report page.
  7. The “Publish” command: This command allows you to publish your Power BI report to the Power BI service, where it can be shared and accessed by others.
  8. “Data Modeling” command : This command allows you to create and manage relationships between tables, create calculated columns and tables, and create hierarchies.
  9. “Analyze” command : This command allows you to perform advanced analytics tasks such as forecasting, clustering, and key drivers analysis.
  10. “Dashboard” command : This command allows you to create, manage, and share interactive dashboards that display multiple visuals and reports on a single screen.

A project video covering Power BI, Power BI — Data Analysis Expressions, Joins, Data Profiling coming soon ( subscribe today) —

That’s it for now.

Find Day 28 Below :

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

Read more —

All the Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

For Python Projects —

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Data Science
Machine Learning
Tech
Programming
Software Development
Recommended from ReadMedium