avatarNaina Chaturvedi

Summary

The content provides a comprehensive guide to Day 2 of the 30-day Machine Learning Operations (MLOps) series, covering Python basics, Pandas and NumPy data manipulation, data preprocessing, and an introduction to MLOps concepts and tools.

Abstract

The web content outlines the curriculum for the second day of a 30-day MLOps learning series, which focuses on essential Python tools and libraries for data manipulation and preprocessing. It emphasizes the importance of understanding Pandas for data analysis, highlighting key operations such as indexing, filtering, merging, and aggregating data within DataFrames. The article also introduces the reader to the broader context of MLOps, including data cleaning, handling missing data, noise reduction, and outlier detection, while providing resources for further learning in data science, machine learning, and related programming techniques. The content serves as a practical roadmap for individuals looking to enhance their skills in the field of MLOps, with a strong emphasis on hands-on project work and the utilization of Jupyter notebooks for implementation.

Opinions

  • The author believes in the significance of a strong foundation in Python, Pandas, and NumPy for practitioners in the field of MLOps.
  • There is an emphasis on the practical application of skills through projects, suggesting that hands-on experience is crucial for learning MLOps.
  • The content suggests that understanding and effectively managing data is a fundamental aspect of MLOps.
  • The author values the use of Jupyter notebooks as a tool for learning and implementing machine learning operations.
  • There is a clear endorsement of continuous learning and skill development, as evidenced by the inclusion of various resources and follow-up materials for further study.
  • The author conveys the idea that staying updated with the latest research papers and industry practices is important for MLOps professionals.
  • The content reflects a belief in the importance of data cleaning and preprocessing as critical steps in the machine learning pipeline.
  • The inclusion of multiple coding examples and links to additional learning materials indicates the author's commitment to providing comprehensive educational content.

Day 2 of 30 days of Machine Learning Ops

With examples and projects…

Pic credits: freecodecamp

Welcome back peeps to the Day 2 of 30 days of MLOps. You can find Day 1 below —

On Day 2, we will covering —

1. Python

2. Pandas and Numpy — Part 1

Highly Recommended Data Science and Machine Learning Courses that you MUST take ( with certificate) —

Complete Data Scientist

Complete Data Analyst

Complete Data Engineering

Complete Machine Learning Engineer

Complete Deep Learning

Complete Natural Language Processing

Complete Self Driving Car Engineer

Projects Videos —

All the projects, data structures, SQL, algorithms, system design, Data Science and ML , Data Analytics, Data Engineering, , Implemented Data Science and ML projects, Implemented Data Engineering Projects, Implemented Deep Learning Projects, Implemented Machine Learning Ops Projects, Implemented Time Series Analysis and Forecasting Projects, Implemented Applied Machine Learning Projects, Implemented Tensorflow and Keras Projects, Implemented PyTorch Projects, Implemented Scikit Learn Projects, Implemented Big Data Projects, Implemented Cloud Machine Learning Projects, Implemented Neural Networks Projects, Implemented OpenCV Projects,Complete ML Research Papers Summarized, Implemented Data Analytics projects, Implemented Data Visualization Projects, Implemented Data Mining Projects, Implemented Natural Leaning Processing Projects, MLOps and Deep Learning, Applied Machine Learning with Projects Series, PyTorch with Projects Series, Tensorflow and Keras with Projects Series, Scikit Learn Series with Projects, Time Series Analysis and Forecasting with Projects Series, ML System Design Case Studies Series videos will be published on our youtube channel ( just launched).

Subscribe today!

Our whole syllabi for 30 days of MLOps is as follows —

1.MLOps Basics and Principles

What is MLOps?

Purpose

What’s important?

2. Data

Complete Python with projects

Pandas and Numpy

Exploratory Data Analysis

Data preprocessing ( Collecting, Labeling and Validating data)

Data Labelling and Advanced Data Labeling Methods

Data Splitting

Feature Engineering

Data Augmentation

3.Modeling

Model Training and Evaluation

Model Baselines

Model Tuning and Optimization

Model Review and governance

Automated Model retraining

Model Deployment and monitoring

Model Inference and Serving

Model Resource Management Techniques

Model Analysis

High-Performance Modeling

4.Developing

End — to — End ML Workflow Cycle

ML workflows

MLOps Logging and Documentation

MLOps Makefile

ML Lake

ML Pipelines and toolkits

MLOps tools and Frameworks

5. Testing and Reproducibility

Git

Versioning

Docker

6. Production

Continuous Integration

Continuous Delivery and Deployment

Monitoring and Logging

Feature Stores

MLOps architecture and Infrastructure Stack

Model Serving Patterns and Infrastructures

Model fairness, Explainability issues, and Mitigate bottlenecks

7. MLOps (Amazing) Papers

Some amazing MLOps research papers that I have read over the years to help you boot up to the industry standards and what’s next in this field.

Let’s get started with Day 2!

In pandas, we will covering the most important Pandas Functions —

Pandas Series and DataFrame

Data pre-processing, Data Cleaning, Missing data, Noise and outlier detection

Indexing

Filtering

Merging

Join

Hierarchical Indexing

Transforming Data Frame

Aggregate Functions

Let’s dive in!!

Pandas is a a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. It’s a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.

DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

Pandas —

It’s an open source Python package written for the Python programming language for data manipulation, analysis and ML tasks

It is built on top of another package named Numpy, which provides support for mathematical computations and multi-dimensional arrays.

For Data Science and ML projects —

Some of the other best Series —

30 Days of Natural Language Processing ( NLP) Series

30 days of Data Structures and Algorithms and System Design Simplified

60 Days of Deep Learning with Projects Series

60 days of Data Science and ML Series with projects

Data Science and Machine Learning Research ( papers) Simplified **

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

30 days of Machine Learning Ops

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

Pandas Series and DataFrame

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). Series in Pandas returns both values and indexes associated with it.

Pic credits : Codebug

Pandas DataFrame is two-dimensional size-mutable, a heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e. data is aligned in a tabular fashion in rows and columns.

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows -

To create Pandas Series —

pd.Series(data, index=index)

Example -

s = pd.Series([1, 1.5, 1.75,])

Pandas DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

To create Pandas DataFrame —

pd.DataFrame(data, index=index)

Example -

pd.DataFrame(Data, index=index)

A Pandas Index is an immutable array or as an ordered set

Example -

i = pd.Index([2, 3, 5, 7, 11])

Data Processing

It’s a technique/process which involves conversion of data into usable and desired form. Data processing starts with data in its raw form and converts it into a more readable format ( image, graph, table, vector file, audio, charts etc)

Mega Compilation : Complete Tech Interview Series Roundup — Part 1

Three types of Data Processing : Manual data processing, Mechanical data processing and Electronic data processing

Various tools —

Calculation and Analysis tools — Excel and Calculators — tools that help in applying relevant formulas to process the whole data

Statistical Tools — SAS

Database tools — Oracle, MongoDb, Hadoop etc that help in processing large amounts of data

Data Cleaning

Data Cleaning is the process of correcting or removing incorrect, incomplete, or duplicate data within a given dataset. Proper data cleaning can make or break your project. Hence, data science professionals usually spend a very large portion of their time on Data Cleaning.

Pic credits : code2tut

The golden rule is — Better data beats fancier algorithms

Ask Questions -

Completeness: Does the given data include all required information?

Validity: Does the given data correspond with business rules and/or restrictions?

Uniformity: Is the given data specified using consistent units of measurement?

Consistency: Is the given data consistent across your datasets?

Accuracy: Is the given data close to the true values?

Data Cleaning is an important process and it starts with removing unwanted samples/observations in the given dataset

Missing Data

Missing data is the data that is not captured for a variable for the observation in question. If the missing values are not handled properly by the data science professional, then he may end up drawing an inaccurate inference about the data. Missing data reduces the statistical power of the analysis, which can distort the validity of the results.

Pic credits : Kaggle

Hence, it is very important to handle missing data because any statistical results based on a dataset with non-random missing values could be biased and lead to inaccurate results in the end

Ways to Handle Missing Values

Drop missing values

Ignore tuples with missing values

Imputation etc

Noisy Data

Noise unwanted/meaningless data items, features or records which don’t help in explaining the feature itself, or the relationship between feature & target. The occurrences of noisy data in data set can significantly impact prediction of any meaningful information and causes the algorithms to miss out patterns in the data. Noise in data set dramatically led to decreased classification accuracy and poor prediction results. It can be — certain anomalies in features & target, irrelevant/weak features and noisy records.

Pic credits : i2tut

Therefore, it becomes important for any data scientist to take care as well as eliminate noise when applying any algorithm over a noisy data.

Techniques to handle Noisy data —

Binning

Regression

Clustering

Outlier Detection

An outlier is an observation that diverges from an overall pattern on a sample. Outliers are extreme values that deviate from other observations on data , they may indicate a variability in a measurement, experimental errors. The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance and can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships. Outlier Detection is the technique of detecting and subsequently excluding outliers from a given set of data.

Pic credits : Geeksforgeeks

Types of Outliers —

Global Outliers : point value is far outside the entirety of the data set

Contextual Outliers : point value which significantly deviates from the rest of the data points in the same context

Collective Outliers : point value as a collection deviate significantly from the entire data set

One hot encoding

One hot encoding is used for treating categorical variables. One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data

Pic credits : codeio

It simply creates additional features based on the number of unique values in the categorical feature

One hot encoder only takes numerical categorical values, hence any value of string type should be label encoded before one-hot encoded

One hot encoding makes our training data more useful and expressive, and it can be rescaled easily

Label Encoding

Label Encoding is used to handle categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

Sklearn provides a method for encoding the categories of categorical features into numeric values

Label encoder encodes labels with credit between 0 and n-1 classes where n is the number of diverse labels

It can be implemented using preprocessing module from sklearn package and them import LabelEncoder class as below:

Pandas Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type.

Pic credits : Codebug
s = pd.Series([100,290,40,199,76])
s

Output —

0     100
1     290
2     40
3    199
4     76
dtype: int64

To check the type —

type(s)

Output —

pandas.core.series.Series

Series.axes attribute returns a list of row axis labels of the given Series object.

s.axes

Output —

[RangeIndex(start=0, stop=5, step=1)]

Checking the DataType of the Series

s.dtype

Output —

dtype('int64')

Series.size — Size attribute returns the number of elements in the underlying data for the given series objects.

s.size

Output —

5

ndim attribute returns the number of dimensions of the underlying data, by definition it is 1 for series objects.

s.ndim

Output —

1

Series.values attribute return Series as ndarray or ndarray-like depending on the dtype.

s.values

Output —

array([ 100,  290,  40, 199,  76], dtype=int64)

We can also specify our Indexes in Strings/Objects.

s1 = pd.Series([1,2,4,5,6],index = ["First","Zero","Second","Third","Fourth"])

Output —

First     1
Zero      2
Second    4
Third     5
Fourth    6
dtype: int64

If we are using the string based indexes and if we run sort_index() throughout the series, then it will arrange the Series elements on the basis of alphabetically.

s1.sort_index()

Output —

First     1
Fourth    6
Second    4
Third     5
Zero      2
dtype: int64

Creating Series with Dictionaries

ages = {'Andrew':31,"Kate":45,"Matthew":26,"Helen":19}
new_ages = pd.Series(ages)
new_ages

Output —

Andrew     31
Kate       45
Matthew    26
Helen      19
dtype: int64

If we only want to select a Particular elements from the dictionary then we can use index.

pd.Series(ages,index =["Andrew","Helen"])

Output —

Andrew    31
Helen     19
dtype: int64

Creating Pandas Series by Numpy Arrays

import numpy as np

We can also create series using numpy.

n_one = np.array([1,2,3,4])
pd.Series(n_one)

Output —

0    1
1    2
2    3
3    4
dtype: int32

Merging Two Series (Concat)

s1 = pd.Series([2,3,55,2,6,44]) 
s2 = pd.Series([42,32,34,2,1,4,42])
pd.concat([s1,s2])

Output —

0       2
1       3
2      55
3       2
4       6
5      44
0      42
1      32
2      34
3       2
4       1
5       4
6      42
dtype: int64

we can use selection and use different selectors to select specific elements from the Series.

l = pd.Series([11,12,13,14,15,16])
l[0:3]

Output —

0    11
1    12
2    13
dtype: int64

Pandas DataFrame

Pic credits : Golinux

Creating a DataFrame

names = {"Names":["Allen","Rob","Harold","Amy"],"Age":[21,11,13,15]}
# Creating a DataFrame using a Dictionary.
new_dic = pd.DataFrame(names)
new_dic["Age"]

Output —

0    21
1    11
2    13
3    15
Name: Age, dtype: int64

We can also Assign Column name —

var = [10,30,20,89,48,40]
df = pd.DataFrame(var,columns = ["Variables"])

We can also create DataFrames from Numpy —

arr = np.random.randint(10,size = (5,2))
arr

Output —

array([[5, 0],
       [6, 3],
       [8, 0],
       [2, 2],
       [8, 0]])

We can assign them the columns name —

new_arr= pd.DataFrame(arr,columns = ["Var1","Var2"])

DataFrame.axes attribute access a group of rows and columns by label(s) or a boolean array in the given DataFrame.

new_arr.axes

Output —

[RangeIndex(start=0, stop=5, step=1), Index(['Var1', 'Var2'], dtype='object')]

To determine shape —

new_arr.shape

Output —

(5, 2)

Checking the Dimension of the DataFrame

new_arr.ndim

Output —

(5, 2)

Checking the total number of elements in the DataFrame

new_arr.size

Output —

10

Getting the Columns Names from the DataFrame

new_arr.columns

Output —

Index(['Var1', 'Var2'], dtype='object')

Index — The index (row labels) of the DataFrame. It basically tells us that how many rows our DataFrame has.

new_arr.index

Output —

RangeIndex(start=0, stop=5, step=1)

Values — DataFrame.values attribute return a Numpy representation of the given DataFrame.

new_arr.values

Output —

array([[5, 0],
       [6, 3],
       [8, 0],
       [2, 2],
       [8, 0]])

Accessing the rows of the DataFrame

dfc = pd.DataFrame({"Name":["Josh","Rachel","Tim","Kate","Zach","Andrew"],"Age":[11,13,16,12,14,18],"Salary":[10000,23000,18000,3900000,19000,24000]})

Output —

dfc.Age

Output —

0    11
1    13
2    16
3    12
4    14
5    18
Name: Age, dtype: int64

Now if we want to access the rows specific —

dfc["Age"][3]

Output —

12

Filtering

Pic credits : SDENotes
employees = pd.DataFrame({"Name":["Josh","Mike","Julia","Sergio"],
                          "Department":["IT","Human Resources","Finance","Supply Chain"],"Income":[4800,5200,6600,5700],
"Age":[24,28,33,41]})
employees

Output —

Now, if want to check according to Specific Department —

employees["Department"] == "IT"

Output —

0     True
1    False
2    False
3    False
Name: Department, dtype: bool

We can also use the loc[] Operator and it gives us the flexibility to choose from between various Departments

employees.loc[employees["Department"] == "IT","Name"]

Output —

0    Josh
Name: Name, dtype: object

Now if we want to know the salary of the employees based on some arithmetic conditions

employees[employees["Income"] >5500]

Output —

employees[(employees["Age"]>30) | (employees["Department"] == "HR")]

Output —

To get opposite of a filter use ~(Tilde) sign —

employees[~(employees["Age"]<35)]

Output —

Filtering with Filter () Function —

employees.filter(items=["Department","Name","Income"])

Output —

Adding Rows — append()

employees.append({"Name":"Romeo"},ignore_index=True)

Output —

It adds automatically to the end of dataframe. But we need to add all values, otherwise it gives nan.

employees.append({"Name":"Romeo","Age":26,"Department":"IT","Income":5500},ignore_index=True)

Output —

Removing Rows —

employees.drop(employees[employees["Age"]>30].index)

Output —

Joins

Used to merge DataFrames.

Pic credits : gfg

Inner Join :- Returns records that have matching values in both tables.

Left Join :- Returns all the rows from the left table that are specified in the left outer join clause.

Right Join :- Returns all records from the right table, and the matched records from the left table.

Full Join :- Returns all records when there is a match in either left or right table.

Cross Join :- Returns all possible combinations of rows from two tables.

Inner Join —

c1 = pd.DataFrame({"Name":['Amy','Allen','Alice','Anderson','Amanda'],"Age":[21,22,26,29,32],"Roll Number":[12,19,29,10,8]})c2 =pd.DataFrame({"Marks":[90,89,82,98,85],"Roll Number":[1,90,29,48,67]})

Use join= “inner”

pd.concat([c1,c2],join= "inner")

Full Join — Returns all records when there is a match in either left or right table.

pd.concat([c1,c2],join = "outer",ignore_index=True)

Left Join — Returns all the rows from the left table that are specified in the left outer join clause, not just the rows in which the columns match.

pd.merge(c1,c2,how ="left")

Right Join :- Returns all records from the right table, and the matched records from the left table.

pd.merge(c1,c2,how ="right")

Aggregate Functions

  • sum() : To compute the sum of a specific Column.
  • min() : To compute minimum value of each Column
  • max() : To compute maximum value of each Column
  • std() : To compute Standard Deviation of each column
  • var() : To Compute variance of each column
  • describe() : To compute statistical summary
  • count() : To count elements by elements.
  • value_count() : To count value in column
  • mean() : To Compute Mean of each column
  • median() : Compute Median of each column

Implementation —

#Create dataframe eemployee
employees = pd.DataFrame({"Name":["A","B","C","D","E","F"],"Department":["Finance","Human Resources","Finance","Supply Chain","IT","Marketing"],"Income":[3000,6000,8000,5500,2300,4400],"Age":[20,25,30,40,21,42]})
employees.count()
employees["Department"].value_counts()
employees.mean()
employees["Income"].sum()
employees["Age"].min()
employees["Age"].max()
employees["Age"].std()
employees.var()
employees.describe()

Transforming Data Frames

Pandas Transform helps in creating a DataFrame with transformed values and has the same axis length as its own.

Syntax: df.transform(function, axis=0, *args, **kwargs)

where function — Function for transforming the data axis : 0 for rows and 1 for column *args : Positional arguments **kwargs : Keyword arguments

Implementation —

import pandas as pd
  
df = pd.DataFrame({"x":[120, 40, 3, None, None,34], 
                   "y":[17, 12, None, 23, None,56], 
                   "z":[200, 216, 101, None, 8,78], 
                   "a":[114, 31, None, 12, 63,32]}) 
  
index_ = ['R1', 'R2', 'R3', 'R4', 'R5','R6']df.index = index_res = df.transform(func = ['log', 'exp'])
  
print(res)

Output —

Grouping

  • Split Object
  • Applying groupby Function
employees = pd.DataFrame({"Name":["A","B","C","D","E","F"], "Department":["Finance","Human Resources","Finance","Supply Chain","IT","Marketing"], "Income":[3000,6000,8000,5500,2300,4400], "Age":[20,25,30,40,21,42]})
emp = employees.groupby("Department")
employees.groupby("Department").mean()

Hierarchical indexing

Hierarchical indexing is the technique in which we set more than one column name as the index. set_index() function is used for when doing hierarchical indexing.

Implementation —

index = pd.MultiIndex.from_product([[2020, 2021], [3, 4]],
                                   names=['year', 'round'])
columns = pd.MultiIndex.from_product([['Claire', 'Kassi', 'Suer'], ['Engg', 'Maths']],
                                     names=['subject', 'class'])data = np.round(np.random.randn(4, 6), 1)
data[:, ::3] *= 5
data += 19df = pd.DataFrame(data, index=index, columns=columns)

Indexing data frames

Indexing means to selecting all/particular rows and columns of data from a DataFrame. In pandas it can be done using three constructs —

.loc() : location based

It has methods like scalar label, list of labels, slice object etc

.iloc() : Interger based

.ix() : Both integer and location based

Implementation —

import pandas as pd
import numpy as npdf = pd.DataFrame(np.random.randn(4, 3),
index = ['a','b','c','d'], columns = ['X', 'Y', 'Z'])
print (df.loc['c']> 0)

Output —

X    False
Y     True
Z     True
Name: c, dtype: bool

Implementation —

import pandas as pd
import numpy as npdf = pd.DataFrame(np.random.randn(8, 4), columns = ['X', 'Y', 'Z', 'A'])# Slicing through list of values
print (df.iloc[[1, 2, 3], [1, 3]])

Output —

Y         A
1  0.566221  1.934828
2 -1.814986 -1.829436
3 -0.264360  0.860286

Day 3:

Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

Keep learning and coding :)

For Python Projects —

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Machine Learning
Artificial Intelligence
Data Science
Tech
Programming
Recommended from ReadMedium