How did we reduce SQL read query from 13mins down to 5mins

Photo by Maksym Kaharlytskyi on Unsplash

Have you ever observed an extended execution time for a query while monitoring your database?

As the business expands, the volume of data will inevitably increase due to a growing number of transactions stored in the database.

In my experience, we have a daily cron job that will query the database to retrieve a list of ~250k users based on certain criteria, involving joins with multiple huge transaction tables. Over time, the runtime of this single database query has escalated from around 5 minutes to 13 minutes. I was assigned to look into this issue because it’s disrupting our business requirements.

Database indexing will be the first thing to come to mind.

What is a database index?

Indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed.

Indexing in PostgreSQL is a process that involves creating data structures that are optimized to efficiently search and retrieve data from tables.

Steps to troubleshoot:

Check on your database monitoring chart. Identify what is causing the bottleneck. It could be CPU utilization, network bottleneck, or memory usage is too high. In our case, we did not identify any particular bottleneck from the monitoring charts. Since we haven’t scaled the database vertically for more than a year while the business is growing rapidly, we decided to increase the instance size.
We scaled the database instance vertically. Most of the time, when the database is running low on free memory, it will start to slow down because there’s not enough space for the database engine to run the read/write query efficiently.
We included a dry run on the query. We scheduled to run the query once before the actual cronjob starts. This is to warm up the database so that the actual query run will be slightly faster because the indexes and memory are already computed during the dry run.

After all the steps we have taken, we couldn’t fix the root cause. Finally, we have decided to dig deep into the complicated query by running explain analyze to see what is going on.

Investigation

SQL query

SELECT * FROM (
	SELECT table_a
		WHERE status = 'SUCCESS'
		AND amount != 0
		AND DATE(collection_at) = '<some_date>'
		GROUP BY user_id
	UNION JOIN
	SELECT * FROM table_b
		WHERE status = 'SUCCESS'
		AND amount != 0
		AND DATE(collection_at) = '<some_date>'
		GROUP BY user_id
	UNION JOIN
	SELECT * FROM table_c
		WHERE status = 'SUCCESS'
		AND amount != 0
		AND DATE(collection_at) = '<some_date>'
		GROUP BY user_id
) 
WHERE unpaid = TRUE
GROUP BY user_id
ORDER BY collection_at DESC;

To find out the indexes of custom columns in different tables, we can use:

SELECT
    schemaname,
    tablename,
    indexname,
    indexdef
FROM
    pg_indexes
ORDER BY
    tablename,
    indexname;

The above shows that we have indeed created the indexes for the 3 columns being used in the WHERE clauses.

indexdef

CREATE INDEX table_a_index ON public.table_a USING btree (status, amount, collection_at)
CREATE INDEX table_b_index ON public.table_b USING btree (status, amount, collection_at)
CREATE INDEX table_c_index ON public.table_c USING btree (status, amount, collection_at)

However, during explain analyze on this query, we did not see that the indexes were being used.

Notice that in the WHERE clause we used DATE(collection_at) which is trimming out the date part of the DATETIME object stored in the table column, however, the index was created with the original DATETIME object.

This is the root cause that is causing the query to run so slowly because we thought it was using the index to perform the read, but it wasn’t. The reason is because the DATE(collection_at) being used in the WHERE clause wasn’t indexed.

It’s like we labeled the file storage with IDs but we are searching the file with names. We will end up having to open up and look through each file to match the name we are looking for.

How we fix this

To fix this quickly, we have created new indexes for DATE(collection_at) to accommodate the SQL query.

CREATE INDEX CONCURRENTLY table_a_collect_at_date ON table_a USING btree(DATE(collection_at));
CREATE INDEX CONCURRENTLY table_b_collect_at_date ON table_b USING btree(DATE(collection_at));
CREATE INDEX CONCURRENTLY table_c_collect_at_date ON table_c USING btree(DATE(collection_at));

We are maintaining the integrity of the existing SQL query as it has remained unchanged for a significant time. Consequently, it is advisable not to alter any established code that has been functioning.

Following the creation of new indexes, it is imperative to verify that the query is effectively utilizing the index. This can be achieved by rerunning the explain analyze command.

The caveat of creating indexes

Creating new indexes consumes disk space and can affect the performance of write operations, such as INSERT, UPDATE, or DELETE. When creating indexes, it’s crucial to ensure that your database has sufficient free storage to accommodate the additional index data.

It’s important to note that indexes may not always resolve all performance issues related to slow-running queries. It is essential to carefully weigh the trade-offs and selectively choose which columns to index. Exercise discipline in running explain analyze to assess whether the created index significantly improves read performance and justifies any overhead introduced to write operations.

What did we learn?

When a particular query is taking too long to run, it is advisable to spend some time learning what has gone wrong.
It is a good practice to run explain analyze to ensure that the query is using the indexes as intended, especially when reading from a huge table.
Always consider the trade-offs and carefully choose the columns to be indexed.