avatarMahesh Saini

Summary

The Grab Order Platform employs a distributed system architecture with separate OLTP and OLAP databases to efficiently process and analyze millions of daily orders.

Abstract

The Grab Order Platform is designed to handle a high volume of transactions, processing millions of GrabFood and GrabMart orders each day. It utilizes a distributed system with distinct databases for online transaction processing (OLTP) and online analytical processing (OLAP). The OLTP system, powered by Amazon DynamoDB, ensures stability and high availability for real-time order processing, while the OLAP system, using MySQL RDS, supports historical and statistical queries. The architecture is tailored to meet the goals of stability, scalability, cost-effectiveness, and data consistency, with a data ingestion pipeline ensuring eventual consistency between the OLTP and OLAP databases. The use of DynamoDB's Global Secondary Indexes (GSIs), sparse indexes, and Time To Live (TTL) features, along with MySQL partitioning and a Kafka-based data ingestion pipeline, demonstrates a sophisticated approach to managing large-scale data with varying access patterns and retention requirements.

Opinions

  • The design choice of using separate databases for OLTP and OLAP is seen as beneficial for stability, scalability, and cost reasons, allowing for tailored performance and reduced operational costs.
  • DynamoDB is favored for its adaptive capacity, high availability, and auto-scaling capabilities, which are crucial for handling high read and write QPS and hotkey traffic in real-time order processing.
  • The use of DynamoDB's GSIs and sparse indexes is considered an advanced and efficient mechanism for managing ongoing orders and reducing table size, leading to better performance and cost savings.
  • MySQL RDS is preferred over Amazon Aurora for OLAP queries due to its maturity and cost-effectiveness, as it meets the requirements for less critical historical and statistical queries without incurring additional charges for IOPS.
  • The data ingestion pipeline, built on Kafka streams, is critical for maintaining consistency between the OLTP and OLAP databases, with robust error handling and retry mechanisms to ensure data integrity.
  • The architecture reflects a thoughtful approach to data retention and cost management, with features like DynamoDB's TTL and RDS partitioning to automatically manage data lifecycle and reduce storage costs.
  • The overall system design is opinionated towards balancing performance, cost, and complexity to support Grab's evolving business needs while ensuring high availability and consistent user experience.

Grab’s Database Architecture to Handle Millions of Orders — System Design

Image from CMRIX

The Grab Order Platform is a distributed system that processes millions of GrabFood or GrabMart orders every day. This post aims to share the journey of how they designed the database solution that powers the order platform.

What we will cover

  • Introduction-Query Patterns
  • Database Design Goals
  • Design Solution
  • Architecture details — OLTP Database-DynamoDB — OLAP Database — Data Ingestion Pipeline
  • Conclusion

Introduction — Query Patterns

Here are some important query examples the Grab Order Platform supports. We can break down queries into two categories:

  • Transactional queries — Transactional queries are critical to online order creation and completion, including the write queries
  • Analytical queries — Analytical queries retrieve historical orders or order statistics on demand. Analytical queries are not essential to the on-call order processing.

a. Write queries — Create an order and update an order

b. Read queries — Get order by ID, Get ongoing orders by passenger ID, Get historical orders by various conditions, Get order statistics (for example, get the number of orders)

Database Design Goals

From the query and traffic patterns, they arrived at the following three design goals:

  1. Stability — The database solution must be able to handle high read and write QPS.
  2. Scalability and cost — The database solution must be able to support fast evolution of business requirements, given that they handle up to a million orders per month.
  3. Consistency — Strong consistency for transactional queries and eventual consistency for analytical queries.

Design Solution

  • The first design principle towards a stable and scalable database solution is to use different databases to serve transactional and analytical queries, also known as OLTP and OLAP queries. This table keeps data for only a short period of time.
  • Meanwhile, an OLAP database has the same set of data, but serves their historical and statistical queries. This database keeps data for a longer time.
  • What are the benefits of this design principle? From a stability point of view, we can choose different databases that can better fulfill their different query patterns and QPS requirements.
  • An OLTP database is the single source of truth for online order processing; any failure in the OLAP database will not affect online transactions.
  • From a scalability and cost point of view, we can choose a flexible database for OLAP to support their fast evolution of business requirements.
  • We can maintain less data in their OLTP database while keeping some older data in their OLAP database.
  • To ensure that the data in both databases are consistent, they introduced the second design principle — the data ingestion pipeline.
  • In the figure below, Order Platform writes data to the OLTP database to process online orders and asynchronously pushes the data into the data ingestion pipeline. The data ingestion pipeline ensures that the OLAP database data is eventually consistent.

Architecture details

1. OLTP Database — DynamoDB

  • There are two categories of OLTP queries, key-value queries (for example, load by order idID and batch queries (for example, Get ongoing orders by passenger ID).

DynamoDB

Scalable and highly available: DynamoDB has a mechanism called adaptive capacity to handle hotkey traffic. Internally, DynamoDB will distribute higher capacity to high-traffic partitions, and isolate frequently accessed items to a dedicated partition. This way, the hotkey can utilise the full capacity of an entire partition, which is up to 3000 read capacity units and 1000 write capacity units.

  • In each DynamoDB table, it has many items with attributes. In each item, it has a partition key and a sort key. The partition key is used for key-value queries, and the sort key is used for range queries.
  • In their case, the table contains multiple order items. The partition key is the order ID. We can easily support key-value queries by the partition key.
  • Batch queries like ‘Get ongoing orders by passenger ID’ are supported by DynamoDB Global Secondary Index (GSI). A GSI is like a normal DynamoDB table, which also has keys and attributes.
  • In their case, they have a GSI table where the partition key is the pax_id_gsi. The attribute pax_id_gsi is linked to the main table. It is eventually consistent with the main table that is maintained by DynamoDB. If the Order Platform queries ongoing orders for Alice, two items will be returned from the GSI table.
  • They also make use of an advanced feature of GSI named sparse index to support ongoing order queries. When they update order status from ongoing to completed, at the same time, they set the pax_id_gsi to empty, so that the linked item in the GSI will be automatically deleted by DynamoDB.
  • At any time, the GSI table only stores the ongoing orders. They use a sparse index mechanism to control thier table size for better performance and to be more cost effective.
  • The next problem is data retention. This is achieved with the DynamoDB Time To Live (TTL) feature. DynamoDB will auto-scan expired items and delete them.
  • But the challenge is that adding TTL to big tables will bring a heavy load to the background scanner and might result in an outage. Their solution is to only add a TTL attribute to the new items in the table. The retention period of their DynamoDB data is three months.
  • Costwise, DynamoDB is charged by storage size and the provision of the read-write capability. The provisioning capability is actually auto-scalable. The cost is on-demand. So it’s generally cheaper than RDS.

2. OLAP Database

  • They use MySQL RDS as the database to support historical and statistical OLAP queries.
  • Why not Aurora? They chose RDS mainly because it is a mature database solution. Even if Aurora can provide better high-availability, RDS is enough to support their less critical use cases. Costwise, Aurora charges by data storage and the number of requested Input/Output Operations per Second (IOPS). RDS charges only by data storage.
  • They use MySQL partitioning for data retention. The order table is partitioned by creation time monthly. Since the data access pattern is mostly by month, the partition key can reduce cross-partition queries. Partitions older than six months are dropped at the beginning of each month.

3. Data Ingestion Pipeline

  • A Kafka stream is used to process data in the data ingestion pipeline. They chose the Kafka stream because it has 99.95% SLA.
  • Even if Kafka can provide 99.95% SLA, there is still the chance of stream producer failures. When the producer fails, they will store the message in an Amazon Simple Queue Service (SQS) and retry. If the retry also fails, it will be moved to the SQS dead letter queue (DLQ), to be consumed at a later time.
  • On the stream consumer side, they use back-off retry at both stream and database levels to ensure consistency. In a worst-case scenario, we can rewind the stream events from Kafka.
  • It is important for the data ingestion pipeline to handle duplicate messages and out-of-order messages. Duplicate messages are handled by the database level unique key (for example, order ID + creation time).
  • For the out-of-order messages, they implemented the following two mechanisms:
  1. Version update: they only update the most recently updated data. The precision of the update time is in microseconds, which is enough for most of the use cases.
  2. Upsert: if the update events occur before the create events, they simulate an upsert operation.

Conclusion

  • In terms of stability, they use DynamoDB as the critical OLTP database to ensure high availability for online order processing.
  • Scalability wise, they use RDS as the OLAP database to support their quickly evolving business requirements by using a rich, multiple index.
  • Cost efficiency is achieved by data retention in both databases. For consistency, they built a single source of truth OLTP database and an OLAP database that is eventually consistent with the help of the data ingestion pipeline.

Don’t forget to hit the Clap and Follow buttons to help me write more articles like this.

NoSQL
Distributed Systems
System Design Interview
Dynamodb
Programming
Recommended from ReadMedium