Kafka best practices edition → How to design Kafka message key and why it is important in determining application performance?
What is a Kafka Message: A record or unit of data within Kafka. Each message has a key and a value, and optionally headers.The key is commonly used for data about the message and the value is the body of the message
Message Key → Can be null or contain some value that say’s something about data, like user/email id or hash of message e.t.c
Message Value → It is the actual data that need to be send to kafka.
Why “Key” value is important →
- Deduplication : If two Kafka keys contain the same value, then the old entry will be removed if log.cleaner.enable is true. This helps to achieve deduplication.
- Ordering consumption: If the requirement of the consumer is to process messages in an orderly fashion, then the key should be defined in a way that is unique to ordering. Say if the key is an email id(like [email protected]), then all messages produced for that email id will land in the same partition. So as a partition can have only one consumer (with in same group id) that reads data, consumer can guarantee ordering. If key is a null or a unique value like the counter, then messages will be distributed across partitions, in which case consumers cannot vouch for ordering.
- Data Skewing: If we use customer email id as key, then all data from the same customer will go to the same partition, so this will cause data skewing. If the application contains only two active customers, then say if Kafka topic contains 10 partitions, then only 2 partitions contains data all the time and remaining 8 partitions contains no data. So this will impact the performance of consumers because even though we increase consumers to increase throughput, say 1 consumer per partition, 8 consumers will be idle and 2 consumers will be working at full capacity to process the messages, so we cannot achieve horizontal scalability.
How to design key value → Depends on use case at hand as shown below
- If ordering consumption is not an issue its always advised to set key to null or some value that is unique to the record( like
: ), that will help to distribute the load across partitions which in turn helps to attain horizontal scalability, i.e we can simply increase consumers to process messages from multiple partitions, as messages are distributed, increasing consumers will increase application performance and processing power. - If ordering consumption is important, then set key to generic grouping identifier, like email id or customer id or organization id. So that all records for a given email or organization or customer, will land in the same partition.
- Old habits die hard. If there is a need to add some metadata to a record, people using kafka before 0.11.0 , used to add it to the kafka “key”.E.g
: . But this is a bad design because, If we are using a compacted topic, adding information to the key would make the record incorrectly appear as unique and if one of the metadata is same for most of the records then it can create data skew as mentioned above. So record level headers were introduced from Kafka 0.11.0 which provide a way to send list of Headers in each record.
List<Header> headers = Arrays.asList(new RecordHeader("header_key", "header_value".getBytes()));
ProducerRecord<String, String> record = new ProducerRecord<>("topic", null, "key", "value", headers);





