How to use SQL in Data Preparation
The Ultimate Guide to the Most Effective Techniques and Strategies

Here’s an open question: Ever wondered why “data preparation” takes up a whopping 80% of the effort in data science work?
Personally, I’d argue it’s more like 90% in many organizations. And hey, it’s no secret that SQL is a big deal in most of these places.
Getting the hang of SQL for data preparation? That’s a game-changer in the data science field.
I’m about to spill the beans on the SQL tricks you’ll need for each step of data preparation.
Let’s break it down: data cleaning, integration, transformation—they're all crucial in ensuring your data’s accurate, consistent, and primed for analysis.
Data Cleaning
Data cleaning, you know, is all about making your data spick and span by tossing out or fixing any incorrect or inconsistent bits. Let’s dive into some common tasks:
Here are some of them:
1. Eliminating Duplicates
Want to get rid of those pesky duplicate entries?
Use the “DISTINCT” keyword in your SQL query to grab only the unique records. Like this:
SELECT DISTINCT order_id FROM orders;And for zapping duplicates based on a particular attribute, “DISTINCT ON” is your friend. It fetches the first unique record for that attribute.
SELECT DISTINCT ON (customer_id) * FROM orders;2. Dealing with missing values
When you’re faced with missing values in SQL, there’s no need to panic.
You’ve got three handy functions at your disposal: COALESCE, NULLIF, and CASE.
Let’s break down how they can save your day.
- First up, we’ve got “COALESCE.” It picks the first value in a list that isn’t null.
So, if you’re not sure whether an order date is there or not, this query has your back, returning the current date if the order date is missing:
SELECT COALESCE(order_date, CURRENT_DATE) FROM orders;- Next, we’ve got "NULLIF.” It gives you null if two values are the same. Otherwise, it’s business as usual, returning the first value.
For instance, it’s great for situations where an order amount of 0 should really be considered null.
SELECT NULLIF(order_amount, 0) FROM orders;- Last but not least, we’ve got "CASE.” It’s perfect for giving missing values a new identity.
With this function, you can assign labels based on order amounts or even mark unknowns for missing data, like this:
SELECT CASE
WHEN order_amount > 1000 THEN 'High'
WHEN order_amount BETWEEN 500 AND 1000 THEN 'Medium'
WHEN order_amount < 500 THEN 'Low'
ELSE 'Unknown'
END AS order_level FROM orders;Learn more about “CASE WHEN”: The Ultimate Guide to Mastering “CASE WHEN” in SQL
3. Standardizing mismatched data types
Dealing with mismatched data types in SQL?
No worries, we’ve got just the tools you need: CAST, CONVERT, and FORMAT. These functions are the key to making your data types uniform.
- Let’s start with "CAST.” It’s like your direct approach to data type transformation.
Say you’ve got an order amount that’s a number, but you need it as an integer. That’s where CAST comes in, like this:
SELECT CAST(order_amount AS INTEGER) FROM orders;- Moving on to “CONVERT.” This one’s a bit sneakier, working indirectly. It changes the data type of a value based on the data type of another value.
For example, converting order dates to “varchar”, matching the format of the current date, goes like this:
SELECT CONVERT(order_date, CURRENT_DATE) FROM orders;- Finally, there’s “FORMAT.” This one’s all about styling your data. It lets you format a value in a specific pattern.
Need to get your order dates looking sharp in ‘YYYY-MM-DD’ format? Here’s how:
SELECT FORMAT(order_date, 'YYYY-MM-DD') FROM orders;4. Grouping and Filtering data
When you’re working with a ton of data in SQL and need to make sense of it, grouping and filtering are your go-to tools.
These operations help you slim down, summarize, or tidy up your data based on specific criteria.
- To kick things off, let’s talk about the “GROUP BY” clause. This is your ticket to clustering data by one or more attributes. It’s like organizing a closet by color or size.
Here, you can also throw in some aggregate functions for more insights. Check this out:
SELECT customer_id, SUM(order_amount) AS total_amount,
AVG(order_amount) AS average_amount
FROM orders
GROUP BY customer_id;This query neatly groups your orders by customer ID and tells you the total and average order amount per customer.
- Next up is the “HAVING” clause. It's kind of similar to a bouncer that only lets certain grouped data through based on your set conditions.
For instance, if you only want to see customers who’ve spent big bucks—over $5,000—your query would be:
SELECT customer_id, SUM(order_amount) AS total_amount
FROM orders
GROUP BY customer_id
HAVING SUM(order_amount) > 5000;- Last but not least, there’s the “WHERE” clause. This one’s all about zeroing in on specific data points. It’s like using a magnifying glass to find exactly what you’re looking for.
For example, if you want to see orders placed in January 2024, here’s how you do it:
SELECT *
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';🔌 Plug:
I would like to share my latest eBooks with you, which will be a very useful resource for you before any data science interview.

- eBook 1: Personal INTERVIEW Ready “SQL” CheatSheet
- eBook 2: Personal INTERVIEW Ready “Statistics” Cornell Notes
- Best Selling eBook: Top 50+ ChatGPT Personas for Custom Instructions
💡 Checkout for more such resources: https://codewarepam.gumroad.com/
Data Integration
Data integration is the process of combining data from different sources into a single, coherent data set.
It helps us increase the completeness and diversity of the data, as well as the scope and depth of the data analysis.
Here are some of the common data integration tasks:
Merging data with SQL joins and unions
Merging data in SQL is like piecing together a puzzle.
It’s all about combining different data sets, maybe to add depth to your data or to forge new connections and insights.
You’ve got two main tools for this: SQL joins and unions.
1. SQL Joins:
Let’s talk about “SQL joins” first.
Imagine you have two puzzles—one with order details and the other with customer information.
You want to see the full picture, right? Joins let you merge these based on a shared piece, like customer ID. Here’s a sneak peek at how it works:
SELECT *
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id;This query is like a handshake between the orders and customer tables, bringing together each order with its respective customer details.
2. Union vs. Union All:
Diving into the world of SQL, you’ll come across UNION and UNION ALL, two ways to merge results from different SELECT statements.
Here’s the deal: “UNION” is like a picky chef who only wants one of each ingredient; it gets rid of any duplicate rows. “UNION ALL,” on the other hand, is like a chef who believes more is merrier, keeping every single row, duplicates, and all.
For example, if you want a combined view of orders and returns, which are structured similarly, you’d go with something like this:
SELECT * FROM orders
UNION
SELECT * FROM returns;🔗 Learn more about joins and their types here: The Ultimate Guide to Mastering “JOINS” in SQL
Data Transformation
Data transformation involves tweaking the data’s format, structure, or even its actual content.
Why? So it fits like a glove for your next steps, whether that’s deep analysis or creating eye-catching visualizations.
It’s all about shaping that raw data into something more meaningful and user-friendly.
Some of the common data transformation tasks that SQL can perform are:
1. Creating new variables:
Ever played around with SQL to give your data some new flavors?
Well, creating new variables or columns by using arithmetic, logical, or string operations is like being a chef in your data kitchen.
You take existing ingredients (columns) and mix them up to create something new.
For instance, let’s say you want to classify order amounts. Check out this recipe:
SELECT order_amount, CASE
WHEN order_amount > 1000 THEN 'High'
WHEN order_amount BETWEEN 500 AND 1000 THEN 'Medium'
WHEN order_amount < 500 THEN 'Low'
ELSE 'Unknown'
END AS order_level FROM orders;This query whips up a new column called order_level.
It sorts each order into ‘High’, ‘Medium’, ‘Low’, or ‘Unknown’, depending on the order amount. It’s a simple way to add more insight to your data!
2. Aggregating data:
Aggregating data is like gathering apples from different trees and counting, weighing, or even finding the biggest one.
Functions like SUM, COUNT, AVG, MIN, MAX, and GROUP_CONCAT are your tools for this. They help you crunch numbers and get a clearer picture of your data.
For instance, imagine you want to know how much each customer has spent in total and on average. You’d use a query like this:
SELECT customer_id,
SUM(order_amount) AS total_amount,
AVG(order_amount) AS average_amount
FROM orders
GROUP BY customer_id;This query groups orders by customer ID, then adds up and averages the order amounts for each customer. It’s a neat way to get insights into customer spending habits.
3. Applying statistical and mathematical functions:
Functions like STDDEV, VARIANCE, ROUND, FLOOR, and CEIL come in handy for all sorts of number crunching.
For instance, if you want to tidy up your order amounts to look neater with just two decimal places, you’d go with something like this:
SELECT ROUND(order_amount, 2) FROM orders;This query uses the ROUND function to elegantly trim down your order amounts, ensuring everything is precise and easy on the eyes.
4. Sorting and ranking data
- Organizing your data can be a breeze with tools like ORDER BY, LIMIT, OFFSET, or RANK.
- Think of these as your data’s personal stylists. They help line up your data in just the right way, whether it’s by order, rank, or even slicing off just a portion you need. Take this example:
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;Here, it’s like telling your data to line up by order date, but only the latest and greatest, please—just the top 10.
This command sorts all your orders in reverse-chronological order and showcases only the ten most recent. Neat, right?
Wrapping it Up
Think of SQL as your power in the realm of data preparation. It’s packed with all sorts of tricks—from tidying up data to weaving it together in new ways.
But here’s the thing: mastering SQL is like learning a fine art.
“It gets better and more intuitive the more you play around with it.”
So, why not jump into your data pools and start experimenting? Good Luck!
Join my community of learners! Subscribe to my newsletter for more tips, tricks, and exclusive content on mastering Data Science & AI. — https://yourdataguide.substack.com/
⭐️ Visit My Gumroad Shop: https://codewarepam.gumroad.com/
Note: This article contains some plugins for my eBooks. If you like my content, please make sure to support me and appreciate my work. As MPP is not yet available in India, I make some cash through these.






