avatarNaina Chaturvedi

Summary

Day 9 of the 30 days of Data Engineering Series with Projects focuses on Query Optimizations in SQL, providing techniques and examples to improve SQL query performance and efficiency.

Abstract

The article is the ninth installment in a comprehensive series aimed at data engineering professionals. It emphasizes the importance of optimizing SQL queries to enhance performance, particularly in data-intensive environments. The author outlines several key techniques for query optimization, including the use of indexes, limiting the number of rows returned, selecting appropriate join types, using subqueries judiciously, avoiding the use of "SELECT *", and employing strategies such as partitioning and caching. The article also discusses the importance of writing efficient SQL code, such as avoiding multiple logical OR operators, unnecessary joins, and the overuse of DISTINCT and functions on indexed columns. Additionally, the author advocates for the use of views, constraints, and temporary tables to improve query execution times. The article concludes with a performance tuning section, stressing the need for continuous monitoring and optimization of SQL queries, and provides a preview of upcoming content in the series.

Opinions

  • The author believes that optimizing SQL queries is crucial for better execution plans and faster read/write operations.
  • There is an emphasis on the practical application of optimization techniques, suggesting that real-world performance can be significantly improved by following these guidelines.
  • The author suggests that understanding how data is stored and accessed is essential for writing optimized SQL queries.
  • The use of indexes is highly recommended for faster data retrieval, but should be balanced with the cost of maintaining these indexes.
  • The article implies that database constraints are not only useful for maintaining data integrity but also for query optimization.
  • The author encourages the use of temporary tables and views as tools to simplify complex queries and enhance performance.
  • There is a clear opinion that continuous learning and staying updated with system design and SQL optimization techniques are vital for data professionals.
  • The author's inclusion of a variety of resources, such as newsletters, YouTube channels, and GitHub repositories, indicates a belief in the value of community-driven learning and sharing of knowledge.

Day 9 of 30 days of Data Engineering Series with Projects

Query Optimizations…

Pic credits : dat.it

Welcome back peeps to Day 9 of Data Engineering Series with Projects!

Day 1 : What’s Data Engineering, Why Data Engineering, Data Engineers — ML Engineers — Data Scientists, Purpose and Scope

Day 2 : Complete Python for Data Engineering — Part 1

Day 3 : Complete Advanced Python for Data Engineering — Part 2

Day 4: Techniques to write efficient and Optimized Code

Day 5 : SQL

Day 6 : Advanced SQL

Day 7 : BigQuery and SQL vs NOSQL databases

Day 8 : Advanced Functions

Day 9 : Query Optimizations

Projects Videos —

All the projects, data structures, SQL, algorithms, system design, Data Science and ML , Data Analytics, Data Engineering, , Implemented Data Science and ML projects, Implemented Data Engineering Projects, Implemented Deep Learning Projects, Implemented Machine Learning Ops Projects, Implemented Time Series Analysis and Forecasting Projects, Implemented Applied Machine Learning Projects, Implemented Tensorflow and Keras Projects, Implemented PyTorch Projects, Implemented Scikit Learn Projects, Implemented Big Data Projects, Implemented Cloud Machine Learning Projects, Implemented Neural Networks Projects, Implemented OpenCV Projects,Complete ML Research Papers Summarized, Implemented Data Analytics projects, Implemented Data Visualization Projects, Implemented Data Mining Projects, Implemented Natural Leaning Processing Projects, MLOps and Deep Learning, Applied Machine Learning with Projects Series, PyTorch with Projects Series, Tensorflow and Keras with Projects Series, Scikit Learn Series with Projects, Time Series Analysis and Forecasting with Projects Series, ML System Design Case Studies Series videos will be published on our youtube channel ( just launched).

Subscribe today!

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Ignito:

System Design Case Studies — In Depth

Design Instagram

Design Netflix

Design Reddit

Design Amazon

Design Messenger App

Design Twitter

Design URL Shortener

Design Dropbox

Design Youtube

Design API Rate Limiter

Design Web Crawler

Design Amazon Prime Video

Design Facebook’s Newsfeed

Design Yelp

Design Uber

Design Tinder

Design Tiktok

Design Whatsapp

Most Popular System Design Questions

Mega Compilation : Solved System Design Case studies

Pre-requisite to Day 9 is to complete Day 1–8( link below):

Day 1 of 30 days of Data Engineering can be found below —

Day 2 of 30 days of Data Engineering can be found below —

Day 3 of 30 days of Data Engineering can be found below —

This is Day 9 of 30 days of Data Engineering Series where we will be covering —

Query Optimizations in SQL

Let’s get started!

Some of the most Query Optimizations techniques are —

  1. Use indexes: Indexes can greatly improve the performance of your queries by allowing the database to quickly locate the data it needs.
  2. Limit the number of rows returned: Use the “LIMIT” clause to limit the number of rows returned by a query. This can improve performance and reduce memory usage.
  3. Use the right join type: Use the appropriate join type (e.g. inner join, left join, right join) based on the specific needs of your query.
  4. Use subqueries wisely: Subqueries can be useful, but they can also slow down a query if used excessively. Use subqueries sparingly and make sure they are optimized.
  5. Avoid using “SELECT *”: Instead of using “SELECT *”, specify the specific columns you need in your query. This can improve performance and reduce memory usage.
  6. Avoid using functions on indexed columns: Avoid using functions on indexed columns in the WHERE clause.
  7. Use Temporary table: Use temporary tables to store intermediate results and then join them with other tables.
  8. Use Explain plan: Use the “EXPLAIN PLAN” statement to analyze the execution plan of a query and identify any potential performance bottlenecks.
  9. Use Partitioning: Use partitioning to divide large tables into smaller, more manageable pieces.
  10. Use caching: Use caching to store the results of frequently-run queries in memory, so they can be retrieved quickly without the need to re-run the query.
-- Use indexes
CREATE INDEX index_name ON table_name (column1, column2, ...);

-- Limit the number of rows returned
SELECT column1, column2, ...
FROM table_name
LIMIT 100;

-- Use the right join type
SELECT column1, column2, ...
FROM table1
INNER JOIN table2 ON table1.column = table2.column;

-- Use subqueries wisely
SELECT column1, column2, ...
FROM table1
WHERE column1 IN (SELECT column2 FROM table2 WHERE condition);

-- Avoid using "SELECT *"
SELECT column1, column2, ...
FROM table_name;

-- Avoid using functions on indexed columns
SELECT column1, column2, ...
FROM table_name
WHERE indexed_column = some_value;  -- Avoid using functions on indexed_column

-- Use Temporary table
CREATE TEMPORARY TABLE temp_table AS
SELECT column1, column2, ...
FROM table1
JOIN table2 ON table1.column = table2.column;

-- Use Explain plan
EXPLAIN PLAN FOR
SELECT column1, column2, ...
FROM table_name
WHERE condition;

-- Use Partitioning
CREATE TABLE table_name (
  column1 datatype,
  column2 datatype,
  ...
)
PARTITION BY RANGE (column1) (
  PARTITION partition_name1 VALUES LESS THAN (value1),
  PARTITION partition_name2 VALUES LESS THAN (value2),
  ...
);

-- Use caching (specific to the database system)
-- Example in MySQL: Use query cache
SET GLOBAL query_cache_size = 1000000;

Let’s deep dive in!

A SQL statement can be written in various ways. An optimized SQL query depends on how the data is stored.

An optimized SQL query helps develop a better execution plan, used to analyze the runtime duration of the queries, Indexes which help fasten the query execution and organize the tables in most optimized way to make read and write operations faster and efficient.

Query optimization is the process of implementing the most efficient and optimal techniques that can be used to improve query performance and the runtime ( as well as other metrics).

This also includes optimal utilization of the resources, use of indexes to make the execution faster and better analysis of the performance metrics.

In order to optimize queries, there are many techniques that one can use —

1. Don’t use logical operators like multiple OR

SQL server evaluates OR after analyzing each component/condition that OR is used with and thus results into low performance.

Instead spilt the query into various components/parts and use UNION operator.

Example —

SELECT * FROM employee

WHERE employee_name = ‘Siya’

UNION

SELECT * FROM employee

WHERE emp_city = ‘Newyork’

2. Use Indexes

Indexes are very useful as they help speed up the query execution and helps faster retrieval of the data.

Syntax —

CREATE INDEX index_name

ON table (column_1, column_2, …column_n)

There are two types of indexes —

  1. Implicit indexes — indexes that are created by databases internally to store, retrieve faster and efficiently.
  2. Composite indexes — indexes that are created by using multiple columns to uniquely identify the data points.

Syntax —

CREATE UNIQUE INDEX index_name

ON tableName (column1, column2, …)

Example —

CREATE INDEX emp_indx

ON employee(emp_name, salary, city)

To drop the indexes —

ALTER TABLE table

DROP INDEX index_name

3. Dont use to many JOINS

JOINS make two or more tables join together which leads to additional overload which in turn affects the efficiency of the SQL query.

Instead of using many joins in the same SQL statement, divide the SQL query into separate subqueries.

4. Avoid using DISTINCT

DISTINCT helps to remove the duplicates and select unique values of the columns. In order to remove duplicates, the processing time needed to process large amount of data is slow which makes the SQL query very inefficient.

5. Avoid Using * with SELECT

Instead of using * which retrieves all the data from the rows, use specific columns especially when dealing with large amount of data. Along with it use indexes which will help fasten the data retrieval process.

6. Use Constraints

In order to avoid ambiguity in the data, we use Constraints which helps us better manage our data. These are nothing but rules which help us maintain integrity of the data and are applied at two levels —

  1. Table level — applied to the whole table
  2. Column level — applied to the data which is stored in the columns

Format —

CREATE TABLE table_name

(

column1 datatype constraint1,

column2 datatype constraint2,

….

)

Example —

CREATE TABLE employee

(

emp_Id int(10) NOT NULL UNIQUE,

employee_name varchar(10),

email varchar(20)

)

In the above example we are using, NOT NULL and UNIQUE constraint.

7. Use Views

Views are helpful in creating complex queries and implementing readability in the SQL statements. Views also help in maintaining data integrity as it shows an accurate and consistent views of data in the database. Most importantly, this helps in providing better security to the data.

Syntax —

CREATE VIEW view_name AS SELECT t1, t2, t3…tn ( column_name) FROM table WHERE condition

To update a view —

CREATE OR REPLACE VIEW view_name AS SELECT t1, t2, t3…tn ( column_name) FROM table WHERE condition

Example —

CREATE VIEW [Salary_sum] AS SELECT employee_name, employee_salary FROM Employee WHERE employee_salary > (SELECT SUM(employee_salary) FROM Employee)

8. Use TOP and Limit

TOP sets the limit on the no of results returned upon running the query. To make sure you get what you want to retrieve, use Top and Limit to see the snapshot/sample of the results.

SELECT TOP 10

employee_name, employee_salary

FROM Employee

-- Enable query optimization techniques
SET OPTIMIZE_FOR_KNOWN_CONSTRAINTS = ON;
SET QUERY_OPTIMIZATION_COMPATIBILITY_LEVEL = 150;

-- Create necessary indexes
CREATE INDEX idx_column1 ON table_name (column1);
CREATE INDEX idx_column2 ON table_name (column2);

-- Limit the result set
SELECT column1, column2, ...
FROM table_name
WHERE condition
LIMIT 100;

-- Minimize data retrieval
SELECT column1, column2, ...
FROM table_name;

-- Avoid unnecessary subqueries
SELECT column1, column2, ...
FROM table_name
WHERE column1 IN (SELECT column1 FROM another_table);

-- Use appropriate join types
SELECT column1, column2, ...
FROM table1
INNER JOIN table2 ON table1.column1 = table2.column1;

-- Optimize query logic
SELECT column1, column2, ...
FROM table_name
WHERE column1 = value AND column2 = value;

-- Analyze and update statistics
EXECUTE sp_updatestats;

-- Use query hints or directives
SELECT column1, column2, ...
FROM table_name
WITH (INDEX(idx_column1));

-- Use temporary tables or table variables
DECLARE @tempTable TABLE (column1 datatype, column2 datatype, ...);
INSERT INTO @tempTable (column1, column2, ...)
SELECT column1, column2, ...
FROM table_name;

-- Monitor query performance
EXECUTE sp_whoisactive;

Code Implementation —

Performance tuning in SQL

It’s the process which is used to improve the performance of the SQL queries i.e reduce the amount of time needed to run the query.

Three things that severly affect the query run time —

  1. Aggregations — which requires a lot of computational overhead
  2. Joins — Increases query run time
  3. Size of the table — If the table size is large ( say ~ 1 million rows) then performance of the SQL query worsens.

So in order to optimize the performance —

  1. Reduce the table size — Filter the data and save it in a temp table and use it to run the queries.
  2. Use Limit — It helps limit the result samples.
  3. Don’t use multiple joins

4. SET STATISTICS TIME ON to track which query is performing poorly during both compilation and execution time.

Code Implementation —

That’s it for now. Day 10 Coming soon!

Let me know if you have questions in the comment section below. Subscribe/ Follow, Like/Clap as it would encourage me to write more in my free time

Stay Tuned!!

Read more —

All the Complete System Design Series Parts —

1. System design basics

2. Horizontal and vertical scaling

3. Load balancing and Message queues

4. High level design and low level design, Consistent Hashing, Monolithic and Microservices architecture

5. Caching, Indexing, Proxies

6. Networking, How Browsers work, Content Network Delivery ( CDN)

7. Database Sharding, CAP Theorem, Database schema Design

8. Concurrency, API, Components + OOP + Abstraction

9. Estimation and Planning, Performance

10. Map Reduce, Patterns and Microservices

11. SQL vs NoSQL and Cloud

12. Most Popular System Design Questions

Github —

Keep learning and coding ;)

Day 5 coming soon!

For Python Projects —

For complete 60 days of Data Science and ML : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML

Follow for more updates. Stay tuned and keep coding! Disclosure: Some of the links are affiliates.

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

Data Science
Machine Learning
Tech
Programming
Artificial Intelligence
Recommended from ReadMedium