DynamoDB — Aggregate or Realtime crunching (Part3)

This article is a follow up one based on the previous two articles

DynamoDB — How to design your data relationship fit into “One Table” (Part 1))
DynamoDB — How to query with many complicated filters (Part 2)

to compare two solutions about heavy query scenarios . You need to understand basic usage about DynamoDB in order to read this article.

Let recap what we discussed in the previous article (part1 and part2).

We are selling product to other businesses and there are 3 different subjects: Business, Customer, Order.

Business: it is a company who has business relationship with your company.
Customer: It is under business. Customer is the individual who has direct contact with your business. e.g. a customer from a coffee shop make an order for some chairs. One business may have more than one customers.
Order: is the oder made by each customer.

We have below data schema saved in the table, id is the partition key and type is the sort key.

Requirement

In this article , we are going to address how we can generate order reports for customers by querying from DynamoDB. Let’s start with business requirement, business wants to show a dashboard to summarise order report for customers. This dashboard includes stats with a time period:

total number of orders with different status with granularity: hourly, weekly, monthly
total amount of orders with different status with granularity: hourly, weekly, monthly

An oder may have 2different status: FINISHED, DECLINED.

For example, our service may receive a request to generate the order monthly reports for last 3 months for the business 002. The response our service return looks like:

[{
  orderCount: 100
  orderAmount: $3000
  declinedOrderCount: 9
  declinedOrderAmount: $200
  period: 2022-01
  type: MONTHLY
},{
  orderCount: 200
  orderAmount: $1000
  declinedOrderCount: 11
  declinedOrderAmount: $400
  period: 2022-02
  type: MONTHLY
},{
  orderCount: 300
  orderAmount: $5000
  declinedOrderCount: 19
  declinedOrderAmount: $500
  period: 2022-03
  type: MONTHLY
}]

The above responses give the report value for last 3 months. Because the request is about monthly report, it only includes 3 element in the response array. If users request daily report, it will include 90 elements (let’s say 30 days per month).

When looking at our database, one obvious solution is to provide query all orders from table based on the time window and aggregate the result in application level. Our query could be:

GSI: businessIdGsi
businessId = 002 AND type between order.1643241600000 and order.1643235749026

Remember what is businessIdGsi? In article part1, we defined this GSI which has businessId as partition key and type as sort key in order to query business related data. Can you see any problem on this design?

Problems

In reality, one business usually has thousands of orders per day. When you send this query to database, the maximum data we can retrieve is 1MB. Let’s assume one order has 1KB size which means one query can only response 1024 orders and the business has 1024 orders per day. When querying all orders over the last 3 months, you will have to send 90 queries (one month has 30 days) in sequence. Then aggregate the result to calculate the report for different order status and different granularity.

Data size problem

That doesn’t sound like a great solution when you data has different size:

If they want to get a yearly order report, it means we have to send 365 queries to database.
One business may have more than 1024 orders per day.
One order payload may be larger than 1KB.

RCUs Limitation

In addition, let think about something from DynamoDB perspective. One guidance about designing DynamoDB key schema is to avoid generating hot partition. One partition is identified by the partition key in the query it uses. In our example of businessIdGsi index, the partition key is businessId . So all queries to generate the report are going to hit the same partition because they are in the same business. You can find the limitation in AWS doc about partition:

In this case, DynamoDB can deliver throughput up to the partition maximum of 3,000 RCUs or 1,000 WCUs to that single item’s primary key.

This 3000 RCUs is a hard limit and we can’t increase. If your queries hit on this limit, the table will be throttled which could fail the request.

Note : DynamoDB provides some flexibility in your per-partition throughput provisioning by providing burst capacity. Whenever you’re not fully using a partition’s throughput, DynamoDB reserves a portion of that unused capacity for later bursts of throughput to handle usage spikes.

DynamoDB currently retains up to 5 minutes (300 seconds) of unused read and write capacity. During an occasional burst of read or write activity, these extra capacity units can be consumed quickly — even faster than the per-second provisioned throughput capacity that you’ve defined for your table.

The burst capacity may help a little bit in this scenario but we can’t rely on that too much. when you receive more requests like this, eventually you may exhaust the burst capacity and throttle your table.

Timeout

The last problem about this design is the time your application need to take in order to generate the report. When using AWS sdk to query from DynamoDB, you can configure a retry count and timeout value. When a partition is throttled, the SDK itself will do a retry in order to fetch the data for you. That sounds like a cherry however too many retries may cost too much time and in general in API design, 30 seconds is the maximum timeout we can support (in either API Gateway or Appsync).

Solution 1: Split Partition

This section will give solutions for above 3 problems (data size, RCUs limitation and Timeout).

Data size

As mentioned in the previous article , sparse index can be used here to reduce the size of data saved in GSI. I won’t detail about what sparse index is. Please read the part2 article for more detailed information. In short, we only save needed fields in the GSI. So only these additional fields we need to generate the report besides key schema, businessId , status , amount . After doing that, the size of one order will be reduced dramatically from 1KB to only a few Bytes. The benefits of doing that is to be able to retrieve more data from one query.

For example, if the order data is reduced from 1KB to 100B, 1MB will include 10,000 orders which improves 10 times performance.

RCUs Limitation

Now, let’s talk about the fun part of this solution: how to solve throttling issue. My idea on that is to avoid querying one partition. Whenever we get a report request, we need to query all orders under one business within the time period from our table. In order to avoid querying on partition, what we can do is to assign a random number on the order field and make it as the partition key in the GSI. Since we need to query based on businessId, we can create a new field in the table called reportKey whose value has the pattern businessId#randomNumber . During testing, the randomNumber can start with a guess range e.g. 1 to 200. We can save this new field whenever we need to save an order. After that, our order data will have this field whose value looks like:

reportKey
002#001
002#002
...
002#200

Next we create a new GSI (orderReportGSI, PK: reportKey, SK: type). With this in hand, our query will be split to 200 queries each time. One for each partition key:

Query orderReportGSI:reportKey=002#001 AND type between order.1643241600000 and order.1643235749026
reportKey=002#002 AND type between order.1643241600000 and order.1643235749026
...
reportKey=002#200 AND type between order.1643241600000 and order.1643235749026

You may ague that whenever we receive a request from frontend, we end up sending 200 queries to database and some of the queries may response empty result because the business may not have values in all partitions.

I have to say that the range (1 to 200) is just a guessing number, you can choose any number fit for your requirement, like (1 to 30), or (1 to 100) etc. But the total data retrieved from the table will be the same. For the empty response, AWS charges us 0.5 RCU which is super cheap.

The advantage of this design is each query hit on a different partition key which avoid generating a hot partition. When saving the order in the new GSI (orderReportGSI), each item will be saved in its own partition which also distribute the data to save WCU/partition.

Timeout

Let’s look at timeout issue. At beginning when we query businessIdGsi , all queries need to be run in parallel because the next query needs to use the next page token from the previous query. But with this new design, the 200 queries can run in parallel. Of cause, if one partition has move than 1MB data it still needs a sub query to fetch the next page. But the data is already distributed by 200, the chance of next page query will be 200 times less than before. And Running queries in parallel is definitely faster than running then in sequence.

Solution 2: Aggregate

A different solution to support the report requirement is to pre-aggregate the report data on DynamoDB based on different granularities. We need to generate total number of orders and total amount of orders based on 3 granularities: hourly, weekly, monthly. Pre-aggregate the data means we update these values on db whenever receiving an order update request.

id                 type             orderCount   orderAmount  
daily#2022-01-01    business#002    100           3000       
daily#2022-01-02    business#002    200           4000       
monthly#2022-01     business#002    3000           30000     
monthly#2022-02     business#002    3300           34000     
hourly#2022-01-01T10:00    businessId#002    20      120       
...

As shown in above table, we aggregate the order data based on different granularities, daily, monthly and hourly. The granularities can be saved in the partition key of the table, in above example, daily#2022–01–01 represents the report for the day 2022–01–01. The sort key type can have the value of businessId since all reports are under one business. When receiving an order update request, our application will generate this report data based on the timestamp in the order request and update these values in database.

For example, to generate the order monthly reports for last 3 months for the business 002, we can send 3 queries to database:

id = monthly#2022-01 and type = business#002
id = monthly#2022-02 and type = business#002
id = monthly#2022-03 and type = business#002

each query returns one month data and we can aggregate in application level and response to users. Same pattern for querying hourly and daily report. We can add more data in each item including declinedOrderCount , declinedOrderAmount etc.

This design will improve the query performance a lot because we only send a few queries to database and we retrieve less data means we pay less on AWS bill. Of cause, we don’t need a high RCU configuration either.

When it comes to implementation, we need to make sure to update these aggregated items whenever the related oder data is changed. If your application has a very complicated logic to update order, or has many different services who need to update the order, it is not easy to add the updating aggregated item logic in all the places. A good place to add this logic is on DynamoDB stream handler. The stream handler can be a lambda function attached to DynamoDB stream. Whenever there is a data change in the table, the handler will be called and we can add this logic to update the aggregated items. It is more robust than updating these items everywhere when you need to update the order.

Compare between 2 solutions

I have gave 2 solutions, Realtime calculation and Aggregate . Each has its own pros and cons. I am going to do a compare and show which one you should consider to use.

Requirement change We always receive requirement change from business team when developing software applications. If we need to add more fields in the report data or change the report generation algorithm, using Realtime calculating is more flexible because we just need to update the calculation logic in application code to include any new data. For example, we may want to support a new order status: refund and need to add refundOrderCount and refundOrderAmount in the report data. Using realtime calculating is pretty easy to support this change but if we use the Aggregate mode, we have to do a data migration to update existing aggregated items in the table. Data migration is production is not easy to do and we want to avoid doing that as much as possible. In this scenario, Realtime calculating will win.
Timezone The report data includes the order data based on time window in the request. In case of global business, the request may come from different countries/timezones. They may want to support generating report for different timezones. If we use Aggregate mode, all data are pre-aggregated based on a certain timezone, it is hard to manage different timezones especially some timezones are even in half hour. In this scenario, Realtime calculating will win.
Cost DynamoDB charges us based on the data write/retrieved to/from database. By using Aggregate mode, we query much less data than Realtime Calculation, so in this scenario, Aggregate will win.
Performance Based on previous discussion, Aggregate mode will definitely beat realtime calculating since it queries less data and less realtime aggregating.

Conclusion

In this article, I showed two solutions to solve a heavy query scenario when using DynamoDB. There are both advantage and disadvantage on each of the solution and there is no hard statement about which one is better. It is completely up to each business scenario. Make sure you understand both pros and cons before making the decision.

Joey Zhao

A software developer who created a lot of bugs also fixed a lot of bugs ( so does everyone else). Working in a fintech company, Zeller, on payment and banking space.

DynamoDB — How to design your data relationship fit into “One Table” (Part 1)

DynamoDB — How to query with many complicated filters (Part 2)

Summarize

DynamoDB — Aggregate or Realtime crunching (Part3)

Requirement

Problems

Data size problem

RCUs Limitation

Timeout

Solution 1: Split Partition

Data size

RCUs Limitation

Timeout

Solution 2: Aggregate

Compare between 2 solutions

Conclusion