avatarNaina Chaturvedi

Summary

The website content outlines Day 7 of the "30 days of Data Analytics with Projects Series," focusing on data collection, data cleaning, and Python programming essentials, while also promoting related educational content and resources.

Abstract

The article is part of a comprehensive series aimed at providing insights into data analytics. It emphasizes the importance of data collection and cleaning, detailing methods for gathering data from various sources such as web scraping, APIs, files, and databases. The use of Python and its libraries, including Pandas and NumPy, is highlighted for data manipulation and cleaning tasks. The article also serves as a directory to a wealth of educational resources, including completed projects, tutorials on Python basics, system design concepts, and data structures and algorithms. Additionally, it invites readers to subscribe to a YouTube channel for video content related to the series and encourages engagement through comments and subscriptions.

Opinions

  • The author believes in the practical application of knowledge, emphasizing the importance of hands-on projects in data analytics education.
  • There is a clear endorsement of Python as the primary language for data analytics, with a comprehensive list of Python topics considered essential for data engineers.
  • The article promotes a systematic approach to learning, with structured series and case studies that build upon each other.
  • The author values community engagement and continuous learning, as evidenced by the invitation to subscribe to the newsletter and YouTube channel for ongoing tech insights.
  • There is an underlying opinion that mastering system design and data structures is crucial for aspiring data scientists and engineers.
  • The author seems to prioritize the sharing of knowledge and resources, providing links to a wide array of learning materials and completed projects.

Day 7 of 30 days of Data Analytics with Projects Series

Pic credits : springboard

Welcome back peeps. Happy to share that we have finished —

Finished Series —

15 days of Advanced SQL Series

30 days of Data Structures and Algorithms Series

14 System Design Case Studies Series

60 Days of Data Science and Machine Learning with projects Series

Complete System Design with most popular Questions Series

What’s covered in the Data Analytics Series till now—

Day 1 : Data Analytics basics and kickstart of Data analytics with projects series

Day 2: Business Understanding — Data Driven Decision Making, Descriptive Analysis, Predictive Analysis, Diagnostic Analysis, Prescriptive Analysis

Day 3 : Data Analytics Ecosystem — Data Life Cycle, Data Analysis complete process ( most important things)

Day 4 : Probability, Conditional Probability, Binomial Distribution, Probability Density Function, Sampling Distribution

Day 5 : Statistics

Day 6 : Basic and Advanced SQL

Day 7 : Data Collection, Data Cleaning and Python

Day 8 : Pandas and Numpy

In this post we will cover Data Collection and Cleaning for the day 7 of the Data Analytics Series.

Data Collection

Data Cleaning

Python

Projects Videos —

All the projects, data structures, SQL, algorithms, system design, Data Science and ML , Data Analytics, Data Engineering, , Implemented Data Science and ML projects, Implemented Data Engineering Projects, Implemented Deep Learning Projects, Implemented Machine Learning Ops Projects, Implemented Time Series Analysis and Forecasting Projects, Implemented Applied Machine Learning Projects, Implemented Tensorflow and Keras Projects, Implemented PyTorch Projects, Implemented Scikit Learn Projects, Implemented Big Data Projects, Implemented Cloud Machine Learning Projects, Implemented Neural Networks Projects, Implemented OpenCV Projects,Complete ML Research Papers Summarized, Implemented Data Analytics projects, Implemented Data Visualization Projects, Implemented Data Mining Projects, Implemented Natural Leaning Processing Projects, MLOps and Deep Learning, Applied Machine Learning with Projects Series, PyTorch with Projects Series, Tensorflow and Keras with Projects Series, Scikit Learn Series with Projects, Time Series Analysis and Forecasting with Projects Series, ML System Design Case Studies Series videos will be published on our youtube channel ( just launched).

Subscribe today!

Let’s get started!

Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes.

  • The data collection process can be divided into two main types: primary data collection and secondary data collection. Primary data collection involves collecting data directly from the source, while secondary data collection involves collecting data that has already been collected by someone else.

Data cleaning is the process of identifying and correcting errors, inconsistencies, and outliers in a dataset. This process is important to ensure that the data is accurate, complete, and consistent.

  • This process can include tasks such as removing duplicate data, filling in missing values, and correcting data entry errors.

Data Collection

It is the first step of data analytics life cycle which is nothing but a process to gather data from the relevant sources. The data can structured, semi structured and unstructured as well as can be in different formats.

Data is collected after understanding the business requirements. Some of the sources can be —

  1. Customer forms
  2. Transaction Records
  3. Research
  4. Cookies and Website visits
  5. Social Media
  6. Surveys
  7. Loyalty programs
  8. Keywords and search results
  9. CRM and open source online repositories

There are several ways to collect data in Python, including:

Web scraping: This involves using Python libraries such as Beautiful Soup, Scrapy, and Selenium to extract data from websites. These libraries allow you to navigate through a website’s HTML structure and extract specific elements such as text, images, and links.

import requests
from bs4 import BeautifulSoup

# Send a GET request to the website
response = requests.get("https://example.com")

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(response.text, "html.parser")

# Extract specific elements from the HTML
title = soup.title.text
paragraphs = soup.find_all("p")

# Print the extracted data
print("Title:", title)
for p in paragraphs:
    print("Paragraph:", p.text)

APIs: Many websites and online services provide APIs (Application Programming Interfaces) that allow you to access their data programmatically. Python libraries such as requests and json can be used to make API calls and process the resulting data.

import requests
import json

# Make an API request
response = requests.get("https://api.example.com/data")

# Process the JSON response
data = json.loads(response.text)
# Access specific data from the response
value = data["key"]

# Print the extracted data
print("Value:", value)

Reading from files: Python’s built-in open() function can be used to read data from various file formats such as CSV, Excel, and JSON. The pandas library also provides functions to read data from these file formats, which makes it easier to work with the data in Python.

import csv
import pandas as pd

# Read data from a CSV file using the csv module
with open("data.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

# Read data from an Excel file using pandas
data_frame = pd.read_excel("data.xlsx")
print(data_frame.head())

Database: Python libraries like sqlite3 and pyodbc can be used to connect to databases, and perform queries to extract data.

import sqlite3
import pyodbc

# Connect to a SQLite database
conn = sqlite3.connect("example.db")
cursor = conn.cursor()

# Execute a query
cursor.execute("SELECT * FROM table_name")
rows = cursor.fetchall()

# Print the retrieved data
for row in rows:
    print(row)

# Close the connection
conn.close()

Scipy’s io library: Scipy’s io library provides a way to read and write data in various file formats, such as Matlab, IDL, and Fortran.

from scipy import io

# Read data from a Matlab file
data = io.loadmat("data.mat")

# Access specific variables from the loaded data
variable1 = data["variable1"]
variable2 = data["variable2"]

# Write data to a Fortran file
io.savemat("output.dat", {"variable1": variable1, "variable2": variable2})

Collecting data through Surveys: Python libraries like pydata-google-auth, google-auth-oauthlib, google-auth-httplib2 can be used to collect data from Google Sheets and Google Forms.

from pydata_google_auth import get_user_credentials
import pandas as pd

# Authenticate and get user credentials
creds = get_user_credentials(["https://www.googleapis.com/auth/spreadsheets.readonly"])

# Read data from a Google Sheet
spreadsheet_id = "your_spreadsheet_id"
sheet_name = "Sheet1"
df = pd.read_excel(f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}")

# Print the retrieved data
print(df.head())

# Collect data from a Google Form (requires google-auth-oauthlib and google-auth-httplib2)
from googleapiclient.discovery import build

service = build("forms", "v1", credentials=creds)
response = service.forms().get(formId="your_form_id").execute()

# Process the response data
form_title = response["title"]
print("Form Title:", form_title)

Data Cleaning

It is the process of figuring out incomplete, missing, incorrect and inaccurate records in the data and then fixing/correcting them.

Pic credits : achorsoftware

It involves —

  1. Removal of unwanted and invalid information
  2. Fixing issues of unknown missing values
  3. Fixing problems with mislabeled features, classes
  4. Handling the outliers

It ensures —

  1. The data is valid and up to date.
  2. Data doesn’t contain any duplicates or missing values
  3. Data doesn’t contain any numerical outliers
  4. It helps define valid data labels for the data which is categorical in nature.

Here are some common data cleaning tasks and the corresponding Python methods:

Removing duplicate rows: The drop_duplicates() function in pandas can be used to remove duplicate rows from a DataFrame.

import pandas as pd

# Create a DataFrame with duplicate rows
data = {'col1': [1, 2, 3, 3, 4, 5, 5],
        'col2': ['A', 'B', 'C', 'C', 'D', 'E', 'E']}
df = pd.DataFrame(data)

# Remove duplicate rows
df = df.drop_duplicates()

# Print the DataFrame without duplicate rows
print(df)

Handling missing data: The fillna() function in pandas can be used to replace missing values with a specific value or using a forward or backward fill. The dropna() function can be used to remove rows or columns that contain missing values.

import pandas as pd

# Create a DataFrame with missing values
data = {'col1': [1, 2, None, 4, 5],
        'col2': ['A', 'B', None, 'D', 'E']}
df = pd.DataFrame(data)

# Replace missing values with a specific value
df_filled = df.fillna('Unknown')

# Remove rows with missing values
df_dropped = df.dropna()

# Print the DataFrames
print("DataFrame with filled missing values:")
print(df_filled)
print("\nDataFrame with dropped missing values:")
print(df_dropped)

Data conversion: The astype() function in pandas can be used to convert columns to a specific data type.

import pandas as pd

# Create a DataFrame with columns of different data types
data = {'col1': [1, 2, 3],
        'col2': ['A', 'B', 'C'],
        'col3': [True, False, True]}
df = pd.DataFrame(data)

# Convert col1 to float data type
df['col1'] = df['col1'].astype(float)

# Print the updated DataFrame
print(df.dtypes)

Outlier detection: The zscore() function in NumPy can be used to calculate the z-score of each value in a dataset, which can be used to identify outliers.

import numpy as np

# Create a NumPy array with values
data = np.array([10, 15, 20, 25, 100])

# Calculate the z-score for each value
z_scores = np.abs((data - np.mean(data)) / np.std(data))

# Identify outliers based on a threshold
threshold = 2.5
outliers = data[z_scores > threshold]

# Print the outliers
print(outliers)

Data normalization: The minmax_scale() function in SciPy can be used to normalize a dataset so that all values are between 0 and 1.

from scipy import stats

# Create a NumPy array with values
data = np.array([10, 20, 30, 40, 50])

# Normalize the data between 0 and 1
normalized_data = stats.minmax_scale(data)

# Print the normalized data
print(normalized_data)

Data transformation: The apply() function in pandas can be used to apply a custom function to each element in a DataFrame.

import pandas as pd

# Create a DataFrame with a column
data = {'col1': [1, 2, 3, 4, 5]}

df = pd.DataFrame(data)

# Square each value in col1 using a custom function
df['col1_squared'] = df['col1'].apply(lambda x: x**2)

# Print the transformed DataFrame
print(df)

Data validation: The isnull() function in pandas can be used to check for missing values in a DataFrame and the nunique() function can be used to check for duplicate values.

import pandas as pd

# Create a DataFrame with missing values and duplicate values
data = {'col1': [1, 2, None, 4, 5],
        'col2': ['A', 'B', None, 'D', 'E'],
        'col3': ['X', 'Y', 'Z', 'X', 'Y']}
df = pd.DataFrame(data)

# Check for missing values
missing_values = df.isnull().any()
print("Missing Values:")
print(missing_values)

# Check for duplicate values
duplicate_values = df.nunique()
print("\nDuplicate Values:")
print(duplicate_values)

Data cleaning using regular expressions: python re module can be used to perform data cleaning operations using regular expressions.

import re

# Define a string with dirty data
data = "abc123def456ghi"

# Remove all numbers from the string
clean_data = re.sub(r'\d+', '', data)

# Print the cleaned data
print(clean_data)

In the upcoming posts we will see in details how to perform data collection and cleaning using Python, Pandas and Numpy.

Python

Python is the language of data. In this series we will code in Python and having said that you need to know what is important to learn in Python.

These are the topics you must know in python to move ahead. As the part of Data Engineering Series we have covered all these in detail, so I have linked each of these topics.

1. Data types, strings, operators, and Chaining Comparison Operators with Logical Operators

2. Python Lists and Dictionaries, Sets, Tuples

3. Loops, Break and Continue Statements

4. Object-Oriented Programming — Class and attributes

5. Python strings in detail

6. Python F-String

7. Map, Classes, Functions and Arguments

8. First Class functions, Private Variables, Global and Non Local Variables, __import__ function

9. Magic Functions, Tuple Unpacking

10. Static Variables and Methods in Python

11. Lambda Functions, Magic methods

12. Inheritance and Polymorphism, Errors and Exception Handling

13. User-defined functions, Python garbage collection, debugger in Python

14. Iterators, Generators, and Decorators, Memoization using Decorators

15. Ordered and Defaultdict, Coroutine

16. Regular expression, Magic methods, Closures

17. ChainMap

18. Python Itertools

19. Advanced python constructs

20. Comprehensions, Named Tuple, Type hinting in Python

21. Efficient Code and Optimization techniques for Python

That’s it for now. Day 8: Coming Soon!

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

Read More —

11 most important System Design Base Concepts

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

13. System Design Template — How to solve any System Design Question

14. Quick RoundUp : Solved System Design Case Studies

System Design Case Studies — In Depth

Design Instagram

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Complete Data Structures and Algorithm Series

Complexity Analysis

Backtracking

Sliding Window

Greedy Technique

Two pointer Technique

Arrays

Linked List

Strings

Stack

Queues

Hash Table/Hashing

Binary Search

1- D Dynamic Programming

Divide and Conquer Technique

Recursion

Some of the other best Series —

60 days of Data Science and ML Series with projects

30 Days of Natural Language Processing ( NLP) Series

30 days of Machine Learning Ops

30 days of Data Structures and Algorithms and System Design Simplified

60 Days of Deep Learning with Projects Series

30 days of Data Engineering with projects Series

Data Science and Machine Learning Research ( papers) Simplified **

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Complete Python Series with Projects

Complete Advanced Python Series with Projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

Exceptional Github Repos — Part 1

Exceptional Github Repos — Part 2

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

For Python Projects —

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Follow for more updates. Stay tuned and keep coding!

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Data Science
Machine Learning
Tech
Programming
Artificial Intelligence
Recommended from ReadMedium