Kafka best practices edition → How to design Kafka message key and why it is important in…

Summary

Effective design of Kafka message keys is crucial for deduplication, message ordering, and avoiding data skewing, which directly impacts application performance and scalability.

Abstract

In Apache Kafka, a message key is a critical component that determines how messages are distributed and consumed. A well-designed key helps in deduplicating messages, ensuring ordered consumption when necessary, and preventing data skewing across partitions. Deduplication is achieved when the same key value leads to the replacement of older entries, assuming the log.cleaner.enable setting is true. Ordering consumption relies on a consistent key to ensure that messages are processed sequentially by a single consumer within a partition. Data skewing can be mitigated by designing keys that distribute data evenly across partitions, thus enabling horizontal scalability and optimal consumer performance. The design of the key should be tailored to the use case: for applications where message order is not a concern, a unique or null key can help distribute the load; for those requiring ordered processing, a generic grouping identifier as the key is recommended. Additionally, the introduction of record-level headers in Kafka 0.11.0 provides a better way to include metadata without affecting the key's functionality.

Opinions

The key should be carefully designed to avoid data skewing, which can lead to uneven consumer workloads and reduced performance.
Using a consistent key for related messages is essential for applications that require ordered processing.
Adding metadata to the key, a practice before Kafka 0.11.0, is considered a poor design choice; instead, record-level headers should be used for metadata.
Setting the key to null or a unique value can enhance horizontal scalability by distributing messages across partitions, allowing for increased throughput by adding more consumers.
The use of compacted topics in conjunction with well-designed keys can improve deduplication and storage efficiency.

Kafka best practices edition → How to design Kafka message key and why it is important in determining application performance?

What is a Kafka Message: A record or unit of data within Kafka. Each message has a key and a value, and optionally headers.The key is commonly used for data about the message and the value is the body of the message

Message Key → Can be null or contain some value that say’s something about data, like user/email id or hash of message e.t.c

Message Value → It is the actual data that need to be send to kafka.

Why “Key” value is important →

Deduplication : If two Kafka keys contain the same value, then the old entry will be removed if log.cleaner.enable is true. This helps to achieve deduplication.

Ordering consumption: If the requirement of the consumer is to process messages in an orderly fashion, then the key should be defined in a way that is unique to ordering. Say if the key is an email id(like [email protected]), then all messages produced for that email id will land in the same partition. So as a partition can have only one consumer (with in same group id) that reads data, consumer can guarantee ordering. If key is a null or a unique value like the counter, then messages will be distributed across partitions, in which case consumers cannot vouch for ordering.

Data Skewing: If we use customer email id as key, then all data from the same customer will go to the same partition, so this will cause data skewing. If the application contains only two active customers, then say if Kafka topic contains 10 partitions, then only 2 partitions contains data all the time and remaining 8 partitions contains no data. So this will impact the performance of consumers because even though we increase consumers to increase throughput, say 1 consumer per partition, 8 consumers will be idle and 2 consumers will be working at full capacity to process the messages, so we cannot achieve horizontal scalability.

How to design key value → Depends on use case at hand as shown below

If ordering consumption is not an issue its always advised to set key to null or some value that is unique to the record( like : ), that will help to distribute the load across partitions which in turn helps to attain horizontal scalability, i.e we can simply increase consumers to process messages from multiple partitions, as messages are distributed, increasing consumers will increase application performance and processing power.

If ordering consumption is important, then set key to generic grouping identifier, like email id or customer id or organization id. So that all records for a given email or organization or customer, will land in the same partition.

Old habits die hard. If there is a need to add some metadata to a record, people using kafka before 0.11.0 , used to add it to the kafka “key”.E.g :. But this is a bad design because, If we are using a compacted topic, adding information to the key would make the record incorrectly appear as unique and if one of the metadata is same for most of the records then it can create data skew as mentioned above. So record level headers were introduced from Kafka 0.11.0 which provide a way to send list of Headers in each record.

List<Header> headers = Arrays.asList(new RecordHeader("header_key", "header_value".getBytes()));
ProducerRecord<String, String> record = new ProducerRecord<>("topic", null, "key", "value", headers);