Day 8 of 30 days of Data Analytics with Projects Series

Welcome back peeps. Happy to share that we have finished —

Finished Series —

15 days of Advanced SQL Series

30 days of Data Structures and Algorithms Series

14 System Design Case Studies Series

60 Days of Data Science and Machine Learning with projects Series

Complete System Design with most popular Questions Series

What’s covered in the Data Analytics Series till now —

Day 1 : Data Analytics basics and kickstart of Data analytics with projects series

Day 2: Business Understanding — Data Driven Decision Making, Descriptive Analysis, Predictive Analysis, Diagnostic Analysis, Prescriptive Analysis

Day 3 : Data Analytics Ecosystem — Data Life Cycle, Data Analysis complete process ( most important things)

Day 4 : Probability, Conditional Probability, Binomial Distribution, Probability Density Function, Sampling Distribution

Day 5 : Statistics

Day 6 : Basic and Advanced SQL

Day 7 : Data Collection, Data Cleaning and Python

Day 8 : Pandas and Numpy

In the last post we covered part 1 of Data Collection, Data Cleaning and Python. This is part 2 of Data cleaning post as day 8 of Data Analytics Series. In this we will cover —

Pandas

Numpy

Pandas and NumPy are two popular Python libraries used for data manipulation and analysis.

Projects Videos —

Subscribe today!

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

Pandas:

Data structures: Pandas provides two main data structures, the Series (1-dimensional) and DataFrame (2-dimensional), for storing and manipulating data.
Data cleaning and preprocessing: Pandas has many built-in functions for handling missing values, filtering, and transforming data.
Data manipulation: Pandas provides powerful methods for merging, grouping, and reshaping data.
Data visualization: Pandas has built-in support for creating various types of plots and visualizations using the matplotlib library.

Code Implementation —

import pandas as pd
import matplotlib.pyplot as plt

# Data Structures: Series and DataFrame
# Create a Series
series_data = pd.Series([10, 20, 30, 40, 50])
print("Series:")
print(series_data)

# Create a DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print("\nDataFrame:")
print(df)

# Data Cleaning and Preprocessing
# Handling Missing Values
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Filtering Data
filtered_df = df[df['Age'] > 30]

# Data Manipulation
# Merging DataFrames
data1 = {'Name': ['John', 'Alice'],
         'Salary': [5000, 6000]}
data2 = {'Name': ['Bob', 'Alice'],
         'Department': ['IT', 'Sales']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df1, df2, on='Name')

# Grouping Data
grouped_df = df.groupby('City')['Age'].mean()

# Data Visualization
df.plot(kind='bar', x='Name', y='Age', legend=False)
plt.xlabel('Name')
plt.ylabel('Age')
plt.title('Age Distribution')
plt.show()

NumPy:

N-dimensional arrays: NumPy provides the ndarray (n-dimensional array) object for efficient storage and manipulation of large arrays of homogeneous data (e.g., numbers).
Mathematical operations: NumPy provides many mathematical functions such as trigonometric, logarithmic, and linear algebra functions that can operate on entire arrays.
Broadcasting: NumPy allows you to perform operations on arrays of different shapes by automatically broadcasting smaller arrays to match the shape of larger arrays.
Interoperability: NumPy is designed to work seamlessly with other libraries, such as Pandas, and is often used as the underlying data structure for other scientific libraries.

Code Implementation —

import numpy as np

# N-dimensional arrays
# Create a 1-dimensional array
array1d = np.array([1, 2, 3, 4, 5])
print("1-dimensional array:")
print(array1d)

# Create a 2-dimensional array
array2d = np.array([[1, 2, 3], [4, 5, 6]])
print("\n2-dimensional array:")
print(array2d)

# Mathematical operations
# Trigonometric functions
sin_values = np.sin(array1d)
print("\nSin values:")
print(sin_values)

# Logarithmic functions
log_values = np.log(array1d)
print("\nLog values:")
print(log_values)

# Linear algebra functions
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])
matrix_product = np.dot(matrix1, matrix2)
print("\nMatrix product:")
print(matrix_product)

# Broadcasting
array3 = np.array([10, 20, 30])
array_broadcasted = array1d + array3
print("\nBroadcasted array:")
print(array_broadcasted)

# Interoperability
import pandas as pd

data = {'col1': [1, 2, 3, 4, 5],
        'col2': ['A', 'B', 'C', 'D', 'E']}
df = pd.DataFrame(data)

numpy_array = df['col1'].to_numpy()
print("\nNumPy array from DataFrame:")
print(numpy_array)

Output —

1-dimensional array:
[1 2 3 4 5]

2-dimensional array:
[[1 2 3]
 [4 5 6]]

Sin values:
[ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427]

Log values:
[0.         0.69314718 1.09861229 1.38629436 1.60943791]

Matrix product:
[[19 22]
 [43 50]]

Broadcasted array:
[11 22 33 14 25]

NumPy array from DataFrame:
[1 2 3 4 5]

Pandas and NumPy are often used together in data analysis tasks, where Pandas provides the high-level data manipulation functions and NumPy provides the low-level array operations.

Let’s get started.

Pandas

Pandas is a a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. It’s a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.

Pandas is an open source Python package written for the Python programming language for data manipulation, analysis and ML tasks

It is built on top of another package named Numpy, which provides support for mathematical computations and multi-dimensional arrays.

Here are some of the most important functions in Pandas:

read_csv() and read_excel(): These functions are used to read in data from a CSV file or Excel file, respectively.
head() and tail(): These functions are used to return the first or last n rows of a DataFrame, respectively.
info() and describe(): These functions are used to get information about the DataFrame, such as the number of rows and columns, the data types of each column, and summary statistics.
drop() and dropna(): These functions are used to remove rows or columns from a DataFrame or to remove rows or columns with missing values, respectively.
sort_values() and sort_index(): These functions are used to sort the DataFrame by one or more columns or by the index, respectively.
groupby(): This function is used to group the DataFrame by one or more columns, and then apply a function to each group.
merge() and join(): These functions are used to combine two DataFrames based on one or more columns, similar to a SQL JOIN operation.
apply(): This function is used to apply a function to each element of a DataFrame or a Series.
pivot_table() and crosstab(): These functions are used to create pivot tables, which are a way to summarize and aggregate data in a DataFrame.
to_csv() and to_excel(): These functions are used to save a DataFrame to a CSV file or Excel file, respectively.

Code Implementation —

import pandas as pd

# Read data from CSV file
df_csv = pd.read_csv('data.csv')

# Read data from Excel file
df_excel = pd.read_excel('data.xlsx')

# Return the first n rows of a DataFrame
first_rows = df_csv.head(5)
print("First 5 rows:")
print(first_rows)

# Return the last n rows of a DataFrame
last_rows = df_excel.tail(3)
print("\nLast 3 rows:")
print(last_rows)

# Get information about the DataFrame
df_info = df_csv.info()
print("\nDataFrame information:")
print(df_info)

# Get summary statistics of the DataFrame
df_desc = df_excel.describe()
print("\nSummary statistics:")
print(df_desc)

# Remove rows or columns from a DataFrame
df_dropped = df_csv.drop(columns=['column1', 'column2'])
print("\nDataFrame with columns dropped:")
print(df_dropped)

# Remove rows with missing values
df_droppedna = df_excel.dropna()
print("\nDataFrame with missing values dropped:")
print(df_droppedna)

# Sort the DataFrame by one or more columns
df_sorted = df_csv.sort_values(by='column1')
print("\nDataFrame sorted by column1:")
print(df_sorted)

# Sort the DataFrame by index
df_sorted_index = df_excel.sort_index()
print("\nDataFrame sorted by index:")
print(df_sorted_index)

# Group the DataFrame by one or more columns and apply a function
df_grouped = df_csv.groupby('column1').sum()
print("\nDataFrame grouped by column1:")
print(df_grouped)

# Merge two DataFrames based on one or more columns
merged_df = pd.merge(df_csv, df_excel, on='common_column')
print("\nMerged DataFrame:")
print(merged_df)

# Apply a function to each element of a DataFrame or a Series
df_applied = df_excel['column1'].apply(lambda x: x * 2)
print("\nApplied function to column1:")
print(df_applied)

# Create a pivot table
pivot_table = pd.pivot_table(df_csv, values='column1', index='column2', columns='column3', aggfunc='sum')
print("\nPivot table:")
print(pivot_table)

# Save DataFrame to a CSV file
df_csv.to_csv('output.csv', index=False)

# Save DataFrame to an Excel file
df_excel.to_excel('output.xlsx', index=False)

Numpy

Numpy is a python library for scientific computing — to work with multidimensional array objects and used to handle large amount of data. An array which is a grid of values and is indexed by a tuple of nonnegative integers is main data structure of the Numpy library.

ndarray is acronym of N-Dimensional Array. We have covered Flattening the arrays, Concatenation and Broadcasting etc in detail in the posts below.

We have covered Pandas and Numpy as the part of MLOps series. In Numpy, you must know —

Create Numpy Arrays

Zeros arrays

Ones arrays

Full arrays

Identity Matrix

reshape()

flatten() function

Concatenation

Broadcasting

Scalar Product

Code Implementation —

import numpy as np

# Create NumPy arrays
array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([[1, 2, 3], [4, 5, 6]])
array3 = np.array([1.1, 2.2, 3.3, 4.4, 5.5])

print("NumPy arrays:")
print("Array 1:", array1)
print("Array 2:")
print(array2)
print("Array 3:", array3)

# Zeros arrays
zeros_array = np.zeros((2, 3))
print("\nZeros array:")
print(zeros_array)

# Ones arrays
ones_array = np.ones((3, 2))
print("\nOnes array:")
print(ones_array)

# Full arrays
full_array = np.full((2, 2), 5)
print("\nFull array:")
print(full_array)

# Identity Matrix
identity_matrix = np.eye(3)
print("\nIdentity matrix:")
print(identity_matrix)

# Reshape array
reshaped_array = np.reshape(array1, (5, 1))
print("\nReshaped array:")
print(reshaped_array)

# Flatten array
flattened_array = array2.flatten()
print("\nFlattened array:")
print(flattened_array)

# Concatenation
concatenated_array = np.concatenate((array1, array3))
print("\nConcatenated array:")
print(concatenated_array)

# Broadcasting
broadcasted_array = array1 + 10
print("\nBroadcasted array:")
print(broadcasted_array)

# Scalar product
scalar_product = np.dot(array1, array3)
print("\nScalar product:")
print(scalar_product)

Output —

NumPy arrays:
Array 1: [1 2 3 4 5]
Array 2:
[[1 2 3]
 [4 5 6]]
Array 3: [1.1 2.2 3.3 4.4 5.5]

Zeros array:
[[0. 0. 0.]
 [0. 0. 0.]]

Ones array:
[[1. 1.]
 [1. 1.]
 [1. 1.]]

Full array:
[[5 5]
 [5 5]]

Identity matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Reshaped array:
[[1]
 [2]
 [3]
 [4]
 [5]]

Flattened array:
[1 2 3 4 5 6]

Concatenated array:
[1. 2. 3. 4. 5. 1.1 2.2 3.3 4.4 5.5]

Broadcasted array:
[11 12 13 14 15]

Scalar product:
66.0

Here are some of the most important functions in NumPy:

np.array(): This function is used to create a new NumPy array from a Python list or another array-like object.
np.zeros() and np.ones(): These functions are used to create new arrays filled with zeros or ones, respectively.
np.shape and np.ndim: These functions are used to get the shape (number of rows and columns) and dimension (number of axes) of an array, respectively.
np.arange(): This function is used to create a new array with a range of evenly spaced values.
np.linspace(): This function is used to create a new array with a specified number of evenly spaced values between two endpoints.
np.random.rand() and np.random.randn(): These functions are used to create new arrays with random values sampled from a uniform or normal distribution, respectively.
np.min() , np.max(), np.mean(), np.sum(): These functions are used to find the minimum, maximum, mean and sum of array elements
np.dot() and np.matmul(): These functions are used to perform matrix multiplication on two arrays.
np.transpose() and np.reshape(): These functions are used to change the shape of an array, transpose flips the array along its diagonal and reshape changes the shape of an array without changing its data
np.save() and np.load(): These functions are used to save an array to a binary file and load it again later.

Code Implementation —

import numpy as np

# np.array()
arr1 = np.array([1, 2, 3, 4, 5])
print("Array 1:", arr1)

arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("Array 2:")
print(arr2)

# np.zeros() and np.ones()
zeros_arr = np.zeros((3, 4))
print("Zeros array:")
print(zeros_arr)

ones_arr = np.ones((2, 2))
print("Ones array:")
print(ones_arr)

# np.shape and np.ndim
print("Shape of arr2:", np.shape(arr2))
print("Dimension of arr2:", np.ndim(arr2))

# np.arange()
arr3 = np.arange(1, 10, 2)
print("Array 3 (arange):")
print(arr3)

# np.linspace()
arr4 = np.linspace(0, 1, 5)
print("Array 4 (linspace):")
print(arr4)

# np.random.rand() and np.random.randn()
rand_arr = np.random.rand(2, 3)
print("Random array:")
print(rand_arr)

randn_arr = np.random.randn(2, 3)
print("Random normal array:")
print(randn_arr)

# np.min(), np.max(), np.mean(), np.sum()
print("Minimum value in arr1:", np.min(arr1))
print("Maximum value in arr1:", np.max(arr1))
print("Mean of arr1:", np.mean(arr1))
print("Sum of arr1:", np.sum(arr1))

# np.dot() and np.matmul()
arr5 = np.array([[1, 2], [3, 4]])
arr6 = np.array([[5, 6], [7, 8]])
dot_product = np.dot(arr5, arr6)
matmul_product = np.matmul(arr5, arr6)
print("Dot product:")
print(dot_product)
print("Matrix multiplication:")
print(matmul_product)

# np.transpose() and np.reshape()
transposed_arr2 = np.transpose(arr2)
reshaped_arr1 = np.reshape(arr1, (5, 1))
print("Transposed arr2:")
print(transposed_arr2)
print("Reshaped arr1:")
print(reshaped_arr1)

# np.save() and np.load()
np.save("saved_array", arr1)
loaded_arr = np.load("saved_array.npy")
print("Loaded array:")
print(loaded_arr)

Output —

Array 1: [1 2 3 4 5]
Array 2:
[[1 2 3]
 [4 5 6]]
Zeros array:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
Ones array:
[[1. 1.]
 [1. 1.]]
Shape of arr2: (2, 3)
Dimension of arr2: 2
Array 3 (arange):
[1 3 5 7 9]
Array 4 (linspace):
[0.   0.25 0.5  0.75 1.  ]
Random array:
[[0.53803657 0.80173244 0.31073361]
 [0.10778522 0.20801113 0.84439972]]
Random normal array:
[[ 1.07312543 -0.66341561  0.63715852]
 [ 0.28833208 -0.31339626  1.31250783]]
Minimum value in arr1: 1
Maximum value in arr1: 5
Mean of arr1: 3.0
Sum of arr1: 15
Dot product:
[[19 22]
 [43 50]]
Matrix multiplication:
[[19 22]
 [43 50]]
Transposed arr2:
[[1 4]
 [2 5]
 [3 6]]
Reshaped arr1:
[[1]
 [2]
 [3]
 [4]
 [5]]
Loaded array:
[1 2 3 4 5]

That’s it for now. Day 9 :

Day 9 of 30 days of Data Analytics with Projects Series

Welcome back peeps. This is Day 10 of 30 days of data analytics.

medium.com

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

11 most important System Design Base Concepts

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

13. System Design Template — How to solve any System Design Question

14. Quick RoundUp : Solved System Design Case Studies

System Design Case Studies — In Depth

Design Instagram

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Complete Data Structures and Algorithm Series

Complexity Analysis

Backtracking

Sliding Window

Greedy Technique

Two pointer Technique

Arrays

Linked List

Strings

Stack

Queues

Hash Table/Hashing

Binary Search

1- D Dynamic Programming

Divide and Conquer Technique

Recursion

Some of the other best Series —

60 days of Data Science and ML Series with projects

30 Days of Natural Language Processing ( NLP) Series

30 days of Machine Learning Ops

30 days of Data Structures and Algorithms and System Design Simplified

60 Days of Deep Learning with Projects Series

30 days of Data Engineering with projects Series

Data Science and Machine Learning Research ( papers) Simplified **

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Complete Python Series with Projects

Complete Advanced Python Series with Projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

Exceptional Github Repos — Part 1

Exceptional Github Repos — Part 2

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication. Launched 7 months…

naina0405.substack.com

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

medium.com

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

medium.datadriveninvestor.com

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

medium.com

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

medium.datadriveninvestor.com

Recurrent Neural Network with Keras

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

medium.datadriveninvestor.com

Clustering Geolocation Data in Python using DBSCAN and K-Means

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

medium.datadriveninvestor.com

Facial Expression Recognition using Keras

Facial Expression Recognition using Keras

Project Implementation…

medium.datadriveninvestor.com

Hyperparameter Tuning with Keras Tuner

Hyperparameter Tuning with Keras Tuner

Project Implementation….

medium.datadriveninvestor.com

Custom Layers in Keras

Custom Layers in Keras

Code implementation …

medium.datadriveninvestor.com

Day 8 of 30 days of Data Analytics with Projects Series

Finished Series —

What’s covered in the Data Analytics Series till now —

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

Code Implementation —

Code Implementation —

Pandas

Code Implementation —

Numpy

Code Implementation —

Code Implementation —

That’s it for now. Day 9 :

Day 9 of 30 days of Data Analytics with Projects Series

Welcome back peeps. This is Day 10 of 30 days of data analytics.

Read More —

11 most important System Design Base Concepts

System Design Case Studies — In Depth

Complete Data Structures and Algorithm Series

Some of the other best Series —

Tech Newsletter —

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication. Launched 7 months…

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

For other projects, tune to —

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

Facial Expression Recognition using Keras

Project Implementation…

Hyperparameter Tuning with Keras Tuner

Project Implementation….

Custom Layers in Keras

Code implementation …