avatarNaina Chaturvedi

Summary

Day 15 of the "30 days of Data Engineering Series with Projects" focuses on advanced Pandas techniques for data preprocessing, including data splitting, binning, mean imputation, interpolation, and data combination methods.

Abstract

The article is part of a comprehensive series aimed at enhancing data engineering skills through practical projects. On Day 15, the focus is on advanced data manipulation techniques using the Pandas library in Python. It covers essential preprocessing steps such as splitting datasets into training and testing sets, categorizing continuous data through binning, handling missing values with mean imputation and interpolation, and merging datasets using concatenation and join operations. The day's curriculum builds upon the foundational knowledge from previous days, emphasizing the importance of data cleaning and preparation in the machine learning pipeline. The article provides code snippets and examples using the Iris dataset to illustrate these concepts, ensuring readers can apply these techniques to real-world data.

Opinions

  • The author emphasizes the necessity of mastering Pandas for effective data engineering and machine learning tasks.
  • Practical implementation is encouraged, with the author suggesting readers load their own data to practice the techniques discussed.
  • The article promotes the use of the author's YouTube channel and newsletter for additional learning resources and project implementations.
  • There is an acknowledgment of the messy nature of real-world data, highlighting the importance of preprocessing skills.
  • The author provides a roadmap for learning, suggesting that readers should complete previous days' content to fully grasp the advanced techniques covered on Day 15.
  • The inclusion of system design case studies and other related series suggests a holistic approach to learning data engineering and system design together.
  • The author's approach to education includes a combination of theoretical knowledge and hands-on practice, as evidenced by the provision of code examples and encouragement to subscribe to channels offering video tutorials.

Day 15 of 30 days of Data Engineering Series with Projects

Welcome back peeps to Day 15 of Data Engineering Series with Projects!

In this we will cover —

Advanced Pandas Techniques

Split data using pandas

Binning Data

Mean Imputation and Interpolate method

Combining Data using Concat and Join

Pre-requisite to Day 15 is to complete Day 1–14( link below):

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Day 8 : Advanced Functions

Day 9 : Query Optimizations

Day 10 : MySQL and PostgreSQL

Day 11: Shell scripting and Linux “touch” command

Day 12 : Map Reduce, Data Warehouse, Data Lakes

Day 13: Pandas, Pandas, Data Cleaning and processing, Outlier Detection, Noisy Data, Missing Data, Pandas Functions, Aggregate Functions, Joins

Day 14 : Numpy

Day 15 : Advanced Pandas Techniques

Projects Videos —

All the projects, data structures, SQL, algorithms, system design, Data Science and ML , Data Analytics, Data Engineering, , Implemented Data Science and ML projects, Implemented Data Engineering Projects, Implemented Deep Learning Projects, Implemented Machine Learning Ops Projects, Implemented Time Series Analysis and Forecasting Projects, Implemented Applied Machine Learning Projects, Implemented Tensorflow and Keras Projects, Implemented PyTorch Projects, Implemented Scikit Learn Projects, Implemented Big Data Projects, Implemented Cloud Machine Learning Projects, Implemented Neural Networks Projects, Implemented OpenCV Projects,Complete ML Research Papers Summarized, Implemented Data Analytics projects, Implemented Data Visualization Projects, Implemented Data Mining Projects, Implemented Natural Leaning Processing Projects, MLOps and Deep Learning, Applied Machine Learning with Projects Series, PyTorch with Projects Series, Tensorflow and Keras with Projects Series, Scikit Learn Series with Projects, Time Series Analysis and Forecasting with Projects Series, ML System Design Case Studies Series videos will be published on our youtube channel ( just launched).

Subscribe today!

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Ignito:

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

This is Day 15 of 30 days of Data Engineering Series where we will be covering —

Advanced Pandas Techniques

Let’s get started!

Real world data is messy ( it contains null values, noises, missing values etc) and in most of the cases when we start the building the ML model we need to clean, format and pre process the data.

  • Splitting data using pandas can be done using the train_test_split function from the sklearn.model_selection module. This function allows you to split a dataset into training and testing sets.
  • Binning data involves grouping a set of continuous or numerical data into a smaller number of discrete “bins” or ranges. This can be done using the cut or qcut functions in pandas, which allow you to specify the number of bins and the labels for each bin.
  • Mean imputation is a method of replacing missing values with the mean value of the dataset. This can be done using the fillna function in pandas and passing in the mean value of the dataset.
  • Interpolation is a method of estimating missing values by taking the average of the values on either side of the missing data point. This can be done using the interpolate function in pandas, which can use various interpolation methods such as linear or polynomial.
  • Combining data using pandas can be done using the concat function to concatenate multiple dataframes along a particular axis, or the join function to join two dataframes on a common column or index.

Code Implementation —

import pandas as pd
import numpy as np

# Splitting Data using pandas

# Creating a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Mike', 'Emily', 'David'],
    'Age': [25, 32, 45, 28, 36],
    'City': ['New York', 'London', 'Paris', 'Sydney', 'Tokyo']
}

df = pd.DataFrame(data)

# Splitting the DataFrame into two based on condition
df1 = df[df['Age'] < 35]
df2 = df[df['Age'] >= 35]

# Displaying the split DataFrames
print("Split DataFrames:")
print(df1)
print(df2)

# Binning Data

# Creating a sample DataFrame
data = {
    'Height': [160, 172, 183, 155, 168, 178, 162, 170],
    'Category': ['Short', 'Average', 'Tall', 'Short', 'Average', 'Tall', 'Short', 'Average']
}

df = pd.DataFrame(data)

# Defining the bin edges
bins = [150, 165, 180, np.inf]

# Applying binning using cut() function
df['Height Category'] = pd.cut(df['Height'], bins=bins, labels=['Short', 'Average', 'Tall'])

# Displaying the binned DataFrame
print("\nBinned DataFrame:")
print(df)

# Mean Imputation and Interpolate Method

# Creating a sample DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, 7, np.nan, 9],
    'C': [np.nan, 10, 11, 12, 13]
}

df = pd.DataFrame(data)

# Mean imputation using fillna() function
df_mean_imputed = df.fillna(df.mean())

# Interpolation using interpolate() function
df_interpolated = df.interpolate()

# Displaying the mean imputed and interpolated DataFrames
print("\nMean Imputed DataFrame:")
print(df_mean_imputed)
print("\nInterpolated DataFrame:")
print(df_interpolated)

# Combining Data using Concat and Join

# Creating sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

# Concatenating DataFrames vertically using concat() function
df_concat_vertical = pd.concat([df1, df2], axis=0)

# Concatenating DataFrames horizontally using concat() function
df_concat_horizontal = pd.concat([df1, df2], axis=1)

# Joining DataFrames using join() function
df_join = df1.join(df2)

# Displaying the concatenated and joined DataFrames
print("\nConcatenated DataFrame (Vertical):")
print(df_concat_vertical)
print("\nConcatenated DataFrame (Horizontal):")
print(df_concat_horizontal)
print("\nJoined DataFrame:")
print(df_join)

Snippet —

Before you start implementing these techniques, load the data of your choice in your working environment. For this post, I’m using Iris data.

Start with importing the necessary libraries such as Pandas and Numpy and loading your data set. Once done, dive in the techniques below —

1. Split data using pandas

In the code below, we are splitting the data into a random sample of rows and removing them from the original data after dropping index values.

iris_data_new= df.copy()
df1=iris_data_new.sample(frac=0.75,random_state=0)
iris_data_new=iris_data_new.drop(df1.index)
df2=iris_data_new.sample(frac=0.25,random_state=0)
iris_data_new=iris_data_new.drop(df2.index)
print(df1.shape)

Output —

(112, 5)

2. Binning Data

Binning is a technique to group/bin your data into multiple buckets which is very helpful if you dealing with continuous numeric data. In pandas you can bin the data using functions cut and cut. First check the shape of your data i.e no of rows and columns.

print(iris_data.shape)

Output —

(150, 5)

Then bin your data using qcut as shown below —

pd.qcut(df['sepal_width'],q=5).value_counts()

Output —

(2.7, 3.0]      50
(1.999, 2.7]    33
(3.1, 3.4]      31
(3.4, 4.4]      24
(3.0, 3.1]      12
Name: sepal_width, dtype: int64

3. Slicing using loc and iloc functions

You can do position based and label based slicing using iloc and loc functions respectively.

iris_data.loc[100:105, 'petal_length':'species']

Output —

iris_data.iloc[:4]

Output —

4. Mean Imputation and Interpolate method

Mean Imputation is a technique in which the missing value is replaced by the mean of available data in the chosen column.

First see if your data has missing values or not.

iris_data.isnull().sum()

Output —

Then calculate the mean and replace the missing value —

iris_data['sepal_width'].mean()

Output —

3.0516778523489942

Replace the missing value

iris_data['sepal_width'].fillna(iris_data['sepal_width'].mean(), inplace=True)
iris_data.isnull().sum()

Interpolate method —

iris_data['sepal_width'].fillna(iris_data['sepal_width'].interpolate(), inplace=True)

5. Combining Data using Concat and Join

Just like in numpy, pd.concat() function is used for concatenation of Series or DataFrame objects in pandas.

df4=pd.concat([df1,df2],axis=0)
print(df4)

Output —

Joins —

Merging and joining the data is one of the most important skill in the data science. Understanding and Implementing it right is crucial in order to analyze data well.

In this we will implement —

  • Inner Join : keep rows from both the tables/data frames based on the specified merge condition.
  • Full Join : keep all the rows form left table and right table with matched rows wherever possible and NaN’s elsewhere.
  • Left Join : keep all the rows form left table and wherever there are missing values in the right table, put it as NaN’s, based on the specified merge condition.
  • Right Join : keep all the rows form right table and wherever there are missing values in the left table, put it as NaN’s, based on the specified merge condition.
Source and credits: Stack Overflow
#Inner Join
df5=pd.merge(df1,df2,on='sepal_length')
print(df5)
#Full outer join
df6=pd.merge(df1,df2,how='outer')
print(df6)
#Left Join
df7=pd.merge(df1,df2,how='left')
print(df7)
#Right Join
df8=pd.merge(df1,df2,how='right')
print(df8)

Complete Code —

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Alice', 'Bob', 'Emily'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Paris', 'London', 'Sydney'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Shape of the DataFrame
print("\nShape of DataFrame:", df.shape)

# Head of the DataFrame
print("\nFirst 2 rows of DataFrame:")
print(df.head(2))

# Tail of the DataFrame
print("\nLast 2 rows of DataFrame:")
print(df.tail(2))

# Accessing column names
print("\nColumn names:")
print(df.columns)

# Accessing a specific column
print("\nValues in 'Name' column:")
print(df['Name'])

# Summary statistics of the DataFrame
print("\nSummary statistics:")
print(df.describe())

# Sorting the DataFrame by a column
print("\nSorted DataFrame by 'Age' column:")
print(df.sort_values('Age'))

# Filtering the DataFrame based on a condition
print("\nFiltered DataFrame where Age > 30:")
filtered_df = df[df['Age'] > 30]
print(filtered_df)

# Grouping the DataFrame by a column and calculating mean
print("\nMean salary by city:")
grouped_df = df.groupby('City')['Salary'].mean()
print(grouped_df)

# Adding a new column to the DataFrame
df['Bonus'] = [2000, 2500, 3000, 1500]
print("\nUpdated DataFrame with 'Bonus' column:")
print(df)


# Create a sample DataFrame
data = {
    'Name': ['John', 'Alice', 'Bob', 'Emily'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Paris', 'London', 'Sydney'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Dropping columns from the DataFrame
columns_to_drop = ['Age', 'Salary']
df_dropped = df.drop(columns_to_drop, axis=1)
print("\nDataFrame after dropping columns:")
print(df_dropped)

# Renaming columns in the DataFrame
column_mapping = {'Name': 'Full Name', 'City': 'Location'}
df_renamed = df.rename(columns=column_mapping)
print("\nDataFrame after renaming columns:")
print(df_renamed)

# Merging two DataFrames based on a common column
data2 = {
    'Name': ['John', 'Alice', 'Bob', 'Dave'],
    'Experience': [2, 5, 7, 4]
}
df2 = pd.DataFrame(data2)

merged_df = pd.merge(df, df2, on='Name')
print("\nMerged DataFrame:")
print(merged_df)

# Handling missing values in the DataFrame
data_with_missing = {
    'Name': ['John', 'Alice', 'Bob', 'Emily'],
    'Age': [25, None, 35, 28],
    'City': ['New York', 'Paris', None, 'Sydney'],
    'Salary': [50000, None, 70000, 55000]
}
df_with_missing = pd.DataFrame(data_with_missing)

print("\nDataFrame with missing values:")
print(df_with_missing)

# Checking for missing values
print("\nMissing values in DataFrame:")
print(df_with_missing.isnull())

# Dropping rows with missing values
df_dropped_missing = df_with_missing.dropna()
print("\nDataFrame after dropping missing values:")
print(df_dropped_missing)

# Filling missing values with a specific value
df_filled = df_with_missing.fillna(0)
print("\nDataFrame after filling missing values:")
print(df_filled)

# Applying a function to a column in the DataFrame
df['Salary_With_Bonus'] = df['Salary'].apply(lambda x: x + 1000)
print("\nUpdated DataFrame with 'Salary_With_Bonus' column:")
print(df)

Snippet —

Code Implementation —

import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv('sales_data.csv')

# Display the first 5 rows of the DataFrame
print("First 5 rows of the DataFrame:")
print(df.head())

# Shape of the DataFrame
print("\nShape of DataFrame:", df.shape)

# Summary statistics of the DataFrame
print("\nSummary statistics:")
print(df.describe())

# Accessing column names
print("\nColumn names:")
print(df.columns)

# Accessing a specific column
print("\nValues in 'Product_Category' column:")
print(df['Product_Category'])

# Filtering the DataFrame based on a condition
print("\nFiltered DataFrame where Total_Sales > 1000:")
filtered_df = df[df['Total_Sales'] > 1000]
print(filtered_df)

# Sorting the DataFrame by a column
print("\nSorted DataFrame by 'Date' column:")
sorted_df = df.sort_values('Date')
print(sorted_df)

# Grouping the DataFrame by a column and calculating sum
print("\nTotal Sales by Product Category:")
grouped_df = df.groupby('Product_Category')['Total_Sales'].sum()
print(grouped_df)

# Adding a new column to the DataFrame
df['Profit'] = df['Total_Sales'] - df['Total_Cost']
print("\nUpdated DataFrame with 'Profit' column:")
print(df)

# Dropping columns from the DataFrame
columns_to_drop = ['Customer_Name', 'Total_Cost']
df_dropped = df.drop(columns_to_drop, axis=1)
print("\nDataFrame after dropping columns:")
print(df_dropped)

# Handling missing values in the DataFrame
print("\nMissing values in DataFrame:")
print(df.isnull().sum())

# Filling missing values with a specific value
df_filled = df.fillna(0)
print("\nDataFrame after filling missing values:")
print(df_filled)

# Saving the updated DataFrame to a new CSV file
df.to_csv('updated_sales_data.csv', index=False)
print("\nUpdated DataFrame saved to 'updated_sales_data.csv'")

Snippet —

That’s it for now.

Find Day 16 Below —

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

Read more —

All the Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

Keep learning and coding ;)

Day 5 coming soon!

For Python Projects —

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Follow for more updates. Stay tuned and keep coding! Disclosure: Some of the links are affiliates.

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Data Science
Machine Learning
Tech
Programming
Software Development
Recommended from ReadMedium