DynamoDB — How to query with many filter fields(Part 2)

This article is a follow up based on the previous article (DynamoDB — How to design your data relationship fit into “One Table” (Part 1)) to give solutions about querying DynamoDB table with very complicated filters. You need to understand basic usage about DynamoDB in order to read this article.

It continue uses the same business mode from the previous article. We are selling product to other businesses and there are 3 different subjects: Business, Customer, Order.

Business: it is a company who has business relationship with your company.
Customer: It is under business. Customer is the individual who has direct contact with your business. e.g. a customer from a coffee shop make an order for some chairs. One business may have more than one customers.
Order: is the oder made by each customer.

We have below data schema saved in the table, id is the partition key and type is the sort key.

Let’s rephrase our problem on searching order data:

There could be more than 10 or 20 fields for each order. And I have to support filtering data for different fields or combination of fields. e.g. querying the first 50 orders whose status is FINISHED and created in last month and finish two days ago and from Melbourne and using VISA card. It is hard to make all the filter fields in the sort key (since we can only have one sort key in one GSI). We have to put all these fields inside filter expression rather than key expression . When we send a query including limit + key expression + filter .

In order to make the query work for above scenario, we can query businessIdGsi like:

Key condition: id=002 and type between order.1643241600000 and order.1643235749026
Filter: status = FINISHED and createdTime between 2022-01-20T00:00:00.000Z and 2022-01-22:00:00.000Z and payMethod = "VISA" and location = "Melbourne"
Limit: 50

There is a very complicated filter expression in above query statement. Since we need to support pagination, the Limit tells how many items can retrieve from the table. However, DynamoDB has two operations about search, scan and query .

Query: it searches based on key condition, all search happens on index.
Scan: it does a in-memory scan on all items from the table which we should avoid as much as we can.

If you send above query to the table, even you are using query API, the filter part is still doing scan . That because DynamoDB does a query on the key condition first and performs a scan on the result of the query later. For example, if there are 100 items match to the key condition, limit will cut the top 50. Then scan will apply to the left 50 items which gives a result set less than 50. Bang!, it is a problem. Customer requests 50 items limit but we only response less than 50. In order to make it meets the requested limit, we have to send the query statement to the table again to fetch the missing limit, then concat the result set until it has either 50 items or no more data from table.

It also means for one request from clients, our backend may end up sending multiple queries to database. If you data is evenly distributed among different values in each field, this approach works fine however it is not possible in real world data. Usually we have far less declined order than finished order. Imagine you have 1 million orders for one business and there are only 20 declined orders. If clients send a request to get 50 declined order, you have to send many queries to database until it queries the entire partition which is the businessId. It will probably runs timeout based on the timeout value set on your API (which is 30 seconds in most cases).

How to solve the issue?

There are a few steps I did in Zeller to improve the performance and I am going to list them one by one in next section.

Retrieve More Data

Frontend usually has a limited number of rows in the UI so it sends a small int value limit in the request. In order to reduce the number of queries we send to database, one approach is to increase the limit to db to retrieve more data, then the scan will apply on a larger dataset size which will left more items to our backend application. The next token is basically a key schema (id + type) in our case, so it is not hard for application to cut the result set and calculate the next token based on the key from the last item.

This approach has a down side which will lead to retrieve more data from the table and since we are charged based on RCU, it means we have to pay more on the data responded from the table. In my application, I put some logic to decide whether it wants to query more from the database. e.g. it doesn’t if there is no filter. or it does if the filter includes a rare used value etc. It is up to each application to decide how much extra data it wants to retrieve.

Sparse Index

When querying from DynamoDB table, one query can only retrieve up to 1MB data. If we use retrieve more data approach in the previous section, we also want to filter out unnecessary fields from the data. This is how sparse index works. When we create the businessIdGsi and customerIdGsi , we can choose what fields we want to put in the index. Data like order may have many fields but we don’t want frontend to query all of them. DynamoDB gives us a good way to filter out these fields when you create the index. We can select the only queried fields as projection attributes when creating the GSI.

By doing that, you can limit the size of item saved in the index which makes one query response more items at a time since more data will fit into 1MB maximum size limit. After creating the sparse index, my data is improved from 600 within 1MB to 3000 within 1MB, nearly 5 times less than before.

Multiple Query in Parallel

Up until now, all queries run in sequence, when you send one query, you need to wait until sending the next query. This process also slows down the performance when you need to send many queries to retrieve enough items from db. One thought about it whether we can run multiple queries in parallel.

When review our key schema, we are using timestamp as the sort key in the table. All querying order requests include a time window, e.g. last week, last month etc. What if we split the sub-query to make each one query one day or one week?

Take this request as an example: query all declined orders from last week , we can send 7 queries in parallel one for each day without limit,

Day1: id=002 and type between order.1643241600000 and order.1643328000000
Day2: id=002 and type between order.1643155200000 and order.164324159999
...

once receiving items from all 7 queries, we oder the items in application level and cut them based on the requested limit. When doing load test on this approach I found my table is throttled very often. But I have enough RCU configuration on the table. After some reading I found DynamoDB has a hard limit on RCU in partition level which is 3000. That means no matter how much RCU you configure in the table, each partition inside the table can only consume 3000 at most. It is hard limit so we can’t increase this number. All sub-queries are running against the same partition key which is businessId that explains why it was throttled.

To make the queries run on different partitions, we need to make the data to be distributed into different partitions. I put an extra field date on order data which has the date value e.g. businesssId_2022-01-27 . It has businessId as prefix followed by the date. Then I create a new GIS dateGsihas date as the partition key and type as sort key. Then the above queries become to:

Day1: date=002_2022-01-26 and type between order.1643155200000 and order.164324159999
Day2: date=002_2022-01-27 and type between order.1643241600000 and order.1643328000000
...

as you can see, each query uses a different value in the partition key that get around the partition limit problem.

A good way to monitor whether you have a hot partition is to enable Contributor Insights metrics in your table and GSI. It gives your a diagram about how each partition value is used. You can use this diagram to identify whether you have a hot partition or not.

Generate Report

This multiple queries approach is particular useful to the cases that you want to generate a report. Take the order as an example, if you need to generate a monthly report about the orders, you will have to query all order data within the month period. By using the dateGsi with multiple queries, you can get all data from 30 partitions in a very fast way. Since it is to generate a report, you don’t want to use limit which means you are not retrieving extra data from the table so that you are not waste your money on AWS bill.

Conclusion

You need to be careful about using this approach since it means you have to retrieve more data from database which will increase your cost. More logic in application need to be implemented to decide which request should use this multiple-queries approach which one should not. That depends on your data model in your business. Basically you need to check whether the filter in the query may point to a rare value from your database. Do it precisely will save your cost and make you enjoy the benefits from DynamoDB.

Joey Zhao

A software developer who created a lot of bugs also fixed a lot of bugs ( so does everyone else). Working in a fintech company, Zeller, on payment and banking space.

Related Articles

DynamoDB — How to design your data relationship fit into “One Table” (Part 1) DynamoDB — Aggregate or Realtime calculating (Part 3)