Day 3 of 30 days of Data Analytics with Projects Series

Welcome back peeps. Weekend is going great and I’m happy to share that we have just finished —

Finished Series —

15 days of Advanced SQL Series

30 days of Data Structures and Algorithms Series

14 System Design Case Studies Series

60 Days of Data Science and Machine Learning with projects Series

Complete System Design with most popular Questions Series

Projects Videos —

Subscribe today!

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

We are now starting a new series — 30 days of Data Analytics with Projects. This series would run in parallel with —

Ongoing Series —

30 days of Data Engineering Series

30 days of MLOps

30 days of Deep Learning Series

ML Research ( papers) Simplified

What’s covered till now —

Day 1 : Data Analytics basics and kickstart of Data analytics with projects series

Day 2: Business Understanding — Data Driven Decision Making, Descriptive Analysis, Predictive Analysis, Diagnostic Analysis, Prescriptive Analysis

Day 3 : Data Analytics Ecosystem — Data Life Cycle, Data Analysis complete process ( most important things)

In this post we will cover Data Analytics Ecosystem —

Data Life Cycle

Data Analysis complete process ( most important things)

Let’s dive in!

The Data Life Cycle refers to the stages that data goes through from its creation to its eventual archiving or deletion.

The stages of the data life cycle typically include:

Data Creation: This is the stage where data is initially collected, generated, or acquired.
Data Entry and Validation: This is the stage where the data is entered into the system and checked for accuracy and completeness.
Data Storage: This is the stage where the data is stored in a database or other type of storage system.
Data Processing: This is the stage where the data is processed, analyzed, and transformed into meaningful information.
Data Output and Distribution: This is the stage where the processed data is presented to users in the form of reports, dashboards, or other types of outputs.
Data Archiving and Backup: This is the stage where the data is backed up and archived for long-term retention.
Data Retention and Destruction: This is the stage where data is retained for a specified period of time and then securely destroyed when it is no longer needed.

Code Implementation for each stage —

import pandas as pd
import sqlite3

# Data Creation
data = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['John', 'Jane', 'Mark'],
    'Age': [25, 30, 35]
})

# Data Entry and Validation
data['Email'] = ['[email protected]', '[email protected]', '[email protected]']
data = data[['ID', 'Name', 'Email', 'Age']]  # Reorder columns if needed
data_valid = data.dropna()  # Check and remove any rows with missing values

# Data Storage
conn = sqlite3.connect('data.db')
data_valid.to_sql('customer_data', conn, if_exists='replace', index=False)
conn.close()

# Data Processing
conn = sqlite3.connect('data.db')
query = 'SELECT * FROM customer_data WHERE Age > 25'
processed_data = pd.read_sql_query(query, conn)
conn.close()

# Data Output and Distribution
processed_data.to_csv('output.csv', index=False)
processed_data.to_excel('output.xlsx', index=False)

# Data Archiving and Backup
import shutil
shutil.copy('data.db', 'data_backup.db')

# Data Retention and Destruction
# Delete the data.db file when it is no longer needed
import os
os.remove('data.db')

Code explanation —

Data Creation: We create a sample dataframe using pandas with columns for ID, Name, Age, and Email.

Data Entry and Validation: We add an “Email” column to the dataframe and reorder the columns if necessary. We then perform data validation by removing any rows with missing values.

Data Storage: We establish a connection to an SQLite database and store the validated data in a table called “customer_data”.

Data Processing: We connect to the SQLite database again and execute a SQL query to select only the rows where the age is greater than 25. The result is stored in the “processed_data” dataframe.

Data Output and Distribution: We save the processed data as a CSV file and an Excel file for distribution to users.

Data Archiving and Backup: We create a backup copy of the database file by copying it to another file named “data_backup.db”.

Data Retention and Destruction: Finally, we delete the database file “data.db” when it is no longer needed.

The Data Analysis process is composed of several steps:

Define the problem or question: Identify the problem or question that needs to be answered, and define the objectives of the analysis.
Collect the data: Collect the data needed to answer the problem or question.
Clean and Prepare the data: Clean and prepare the data for analysis, this includes removing duplicate, missing values, outliers, etc.
Explore the data: Explore the data to understand the characteristics and patterns of the data.
Model the data: Develop models to answer the problem or question.
Evaluate the model: Evaluate the model’s performance and accuracy.
Communicate the results: Communicate the results of the analysis in a clear and concise way to the stakeholders.
Take action: Based on the results, take the appropriate actions to solve the problem or answer the question.

Data Life Cycle

There are 6 steps in the data analytics lifecycle.

Objective — It consists of defining business objectives, gathering required information, Define analysis methods and identify the end result/goal.

# Example code for defining business objectives and analysis methods

business_objective = "Increase customer retention rate by 10% within the next quarter."

required_information = [
    "Customer churn data",
    "Customer engagement metrics",
    "Marketing campaign data",
    # Add more required information as needed
]

analysis_methods = [
    "Predictive modeling",
    "Segmentation analysis",
    "Correlation analysis",
    # Add more analysis methods as needed
]

end_result = "A set of actionable recommendations to improve customer retention."

# Print the defined business objectives, required information, analysis methods, and end result
print(f"Business Objective: {business_objective}")
print("Required Information:")
for info in required_information:
    print(f"- {info}")
print("Analysis Methods:")
for method in analysis_methods:
    print(f"- {method}")
print(f"End Result: {end_result}")

Understanding the data — It consists of raw data collection from sources, analyzing the data requirements, check for right data and its characteristics.

# Example code for understanding the data

# Assume data collection from various sources and analysis of data requirements have been done

# Check data characteristics
data = pd.read_csv('data.csv')  # Assuming data is collected from a CSV file
print("Data Characteristics:")
print(data.head())  # Print the first few rows of the data
print(data.info())  # Print the information about the data (e.g., data types, missing values)

Prepare the data — It consists of cleaning, formatting , manipulating and blending the data.

# Example code for data preparation

# Assume data cleaning, formatting, manipulation, and blending steps have been performed

# Clean the data
cleaned_data = data.dropna()  # Drop rows with missing values

# Format and manipulate the data
cleaned_data['Date'] = pd.to_datetime(cleaned_data['Date'])  # Convert 'Date' column to datetime format
cleaned_data['Sales'] = cleaned_data['Sales'].astype(float)  # Convert 'Sales' column to float data type

# Blend the data
blended_data = cleaned_data.merge(another_dataset, on='CustomerID', how='inner')  # Perform data blending with another dataset

# Print the prepared data
print(blended_data.head())

Exploratory Data Analysis — It consists of developing the methodology, determine the important variables and features and build visualizations and prepare the model.

# Example code for exploratory data analysis

# Assume the methodology development, variable selection, visualization, and model preparation steps have been performed

# Develop the methodology
methodology = "Perform customer segmentation using RFM analysis and build a predictive churn model."

# Determine important variables and features
important_variables = ['Recency', 'Frequency', 'Monetary']

# Build visualizations
plt.scatter(blended_data['Recency'], blended_data['Monetary'])
plt.xlabel('Recency')
plt.ylabel('Monetary')
plt.title('Recency vs Monetary')
plt.show()

# Prepare the model
X = blended_data[important_variables]
y = blended_data['Churn']
model = RandomForestClassifier()
model.fit(X, y)

Modeling and validation — It consists of building the models, assess the models, evaluate the results, review the validation results.

# Assume model building, assessment, result evaluation, and validation steps have been performed

# Build the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Assess the model
y_pred = model.predict(X_test)

# Evaluate the results
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# Review the validation results
validation_results = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
}

# Print the validation results
print("Validation Results:")
for metric, value in validation_results.items():
    print(f"- {metric}: {value}")

Visualize and Communicate — Prepare storyline and dashboard, Communicate the key insights, determine the best analytics that influences the business decision the most, make recommendations.

# Assume the preparation of storyline, dashboard, and communication of insights have been done

# Prepare storyline and dashboard
storyline = "Based on the analysis, the following key insights were identified:"

key_insights = [
    "Customers who have a high recency value and low frequency and monetary values are at a higher risk of churn.",
    "The predictive churn model achieved an accuracy of 85% on the test data.",
    "The most influential variables for churn prediction are Recency, Frequency, and Monetary.",
]

dashboard = {
    'Key Insights': key_insights,
    'Recommendations': "To improve customer retention, targeted marketing campaigns should be launched for customers identified as high-risk based on the churn model.",
    # Add more dashboard components as needed
}

# Print the storyline and dashboard
print(storyline)
print("Key Insights:")
for insight in key_insights:
    print(f"- {insight}")
print("Recommendations:")
print(dashboard['Recommendations'])

Data Analysis complete Process ( Most important things)

While there are many steps and processes there are 3 things that most important ( that you should know) —

How to extract data from sources and ingest into data pipelines? ( It’s a part of our data engineering series)
How to clean the data and prepare compelling visualization and storyline/dashboard?
How to take action once the insights have been communicated?

We covered the different data analysis types in our previous post.

In this series, we will cover all the important steps that you should know and most importantly how to prepare data pipelines, which chart to use when and how to prepare compelling storyboards/dashboards — all through projects.

That’s it for now. Day 4 -

Day 4 of 30 days of Data Analytics with Projects Series

Welcome back peeps. Happy to share that we have just finished —

medium.com

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

11 most important System Design Base Concepts

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

13. System Design Template — How to solve any System Design Question

14. Quick RoundUp : Solved System Design Case Studies

System Design Case Studies — In Depth

Design Instagram

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Complete Data Structures and Algorithm Series

Complexity Analysis

Backtracking

Sliding Window

Greedy Technique

Two pointer Technique

Arrays

Linked List

Strings

Stack

Queues

Hash Table/Hashing

Binary Search

1- D Dynamic Programming

Divide and Conquer Technique

Recursion

Some of the other best Series —

60 days of Data Science and ML Series with projects

30 Days of Natural Language Processing ( NLP) Series

30 days of Machine Learning Ops

30 days of Data Structures and Algorithms and System Design Simplified

60 Days of Deep Learning with Projects Series

30 days of Data Engineering with projects Series

Data Science and Machine Learning Research ( papers) Simplified **

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Complete Python Series with Projects

Complete Advanced Python Series with Projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

Exceptional Github Repos — Part 1

Exceptional Github Repos — Part 2

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication. Launched 7 months…

naina0405.substack.com

For Python Projects —

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

medium.com

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

medium.datadriveninvestor.com

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

medium.com

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

medium.datadriveninvestor.com

Recurrent Neural Network with Keras

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

medium.datadriveninvestor.com

Clustering Geolocation Data in Python using DBSCAN and K-Means

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

medium.datadriveninvestor.com

Facial Expression Recognition using Keras

Facial Expression Recognition using Keras

Project Implementation…

medium.datadriveninvestor.com

Hyperparameter Tuning with Keras Tuner

Hyperparameter Tuning with Keras Tuner

Project Implementation….

medium.datadriveninvestor.com

Custom Layers in Keras

Custom Layers in Keras

Code implementation …

medium.datadriveninvestor.com

Day 3 of 30 days of Data Analytics with Projects Series

Finished Series —

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

Ongoing Series —

What’s covered till now —

Let’s dive in!

Data Life Cycle

Data Analysis complete Process ( Most important things)

That’s it for now. Day 4 -

Day 4 of 30 days of Data Analytics with Projects Series

Welcome back peeps. Happy to share that we have just finished —

Read More —

11 most important System Design Base Concepts

System Design Case Studies — In Depth

Complete Data Structures and Algorithm Series

Some of the other best Series —

Tech Newsletter —

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication. Launched 7 months…

Complete Python And Projects — Mega Compilation

Everything that you need to know in Python with Projects…

Analyzing Video using Python, OpenCV and NumPy

With Code Implementation…

Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Connect the ML dots…

For other projects, tune to —

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

Facial Expression Recognition using Keras

Project Implementation…

Hyperparameter Tuning with Keras Tuner

Project Implementation….

Custom Layers in Keras

Code implementation …