avatarhuizhou92

Summary

The web content compares Redis streams with Kafka, highlighting their architectural differences and use cases for event processing and ordered data handling.

Abstract

The article titled "Redis stream vs Kafka, A clash of two kings" delves into the architectural nuances and practical applications of Redis streams and Kafka. Kafka is recognized for its ability to handle large-scale data processing, exemplified by LinkedIn's use with 60 clusters and 1100 brokers in 2015. It excels in partitioned, ordered, event processing, which is crucial for scenarios like search indexing and user behavior tracking where event order is paramount. Redis, with its "stream" data structure introduced in version 5.0, offers a different approach to event storage and processing. While Redis streams are similar to Kafka's partitions in being persistent and ordered, they differ significantly in their consumer group model, which aligns more with traditional job queue systems and lacks the strict ordering guarantees of Kafka. To emulate Kafka's stream processing capabilities in Redis, additional client library features are necessary, such as event partitioning, worker partition assignment, and ordered processing with acknowledgment. The article suggests that while Redis provides fast in-memory processing, it is not designed for long-term, unlimited data storage, unlike Kafka. This necessitates the use of external storage solutions like S3 for replaying events, adding complexity but potentially reducing costs. The ease of deployment and operation, as well as the versatility of Redis, make it an attractive alternative to Kafka for those willing to invest in building out the necessary stream processing features.

Opinions

  • Kafka is highly scalable and reliable for large-scale data processing and maintaining event order, which is critical for many applications.
  • Redis streams offer a lightweight alternative to Kafka for event processing, with the advantage of being part of the Redis in-memory data store.
  • The author suggests that Redis' consumer group model is less sophisticated than Kafka's, leading to a loss of ordering guarantees, which is a core requirement for stream processing.
  • To achieve Kafka-like stream processing with Redis, significant additional development is required to implement features such as event partitioning and worker coordination.
  • The article posits that Redis' in-memory nature makes it unsuitable for storing large amounts of data long-term, necessitating the use of external storage solutions.
  • Despite the additional complexity, using Redis for stream processing can be cost-effective and easier to manage compared to Kafka.
  • The Python library Runnel is highlighted as a solution that brings Kafka-like semantics to Redis, simplifying the development of stream processing applications on top of Redis.

Redis stream vs Kafka, A clash of two kings

source:https://www.educba.com/redis-vs-kafka/

Note: Non-members can read the full story in this link.

Kafka is known for addressing large-scale data processing problems and is widely deployed in the infrastructure of many well-known companies. As early as 2015, LinkedIn had 60 clusters with a total of 1100 Brokers, processing 13 million messages per second.

But it turns out that scale isn't the only thing Kafka excels at. Its advocated programming paradigm -- partitioned, ordered, event processing -- is a good solution for many problems you may encounter. For example, if events represent rows to be indexed into a search database, then the last modification is the last index, which is important, otherwise, searches will indefinitely return stale data. Similarly, if events represent user behavior, processing the second event ("user account upgrade") may depend on the first ("user account creation"). This paradigm differs from traditional job queue systems, where events are popped from the queue by many workers simultaneously, which is simple and scalable, but it breaks any ordering guarantee. Suppose you want ordered processing, but perhaps you don't want to use Kafka because of its reputation as a difficult-to-operate or expensive heavyweight system. How does Redis with its "stream" data structure, released with version 5.0, compare? Does it solve the same problems?

Kafka Architecture

Let's first take a look at the basic architecture of Kafka. The fundamental data structure is the topic. It's a time-ordered record sequence, appended-only. The benefits of using this data structure are well described in Jay Kreps' classic blog post, The Log.

Topics are partitioned to allow them to scale: each topic can be hosted on separate Kafka instances. Records in each partition are assigned consecutive IDs called offsets, which uniquely identify each record in the partition. Consumers process records sequentially, keeping track of the last offset they've seen. Since records are persisted in a topic, multiple consumers can independently process records.

In practice, you may distribute your processing across many machines. To achieve this, Kafka provides an abstraction called a "consumer group," which is a group of cooperating processes consuming data from a topic. Partitions of a topic are assigned to members within the group. Then, when members join or leave the group, partitions must be reassigned to ensure each member gets a fair share of the partitions. This is called the rebalancing algorithm.

Note that a partition is processed by only one member of the consumer group. (But a member may be responsible for multiple partitions) This ensures strictly ordered processing. This toolkit is very useful. You can easily scale your processing by adding more workers, while Kafka handles distributed coordination problems.

Redis Stream Data Structure

How does Redis' "stream" data structure compare? Redis streams conceptually equate to a partition of the Kafka topic described above, but with some minor differences. It's a persistent, ordered event storage (similar to Kafka). It has a configurable maximum length (as opposed to retention period in Kafka). Events store keys and values, akin to Redis Hash (as opposed to single key-value in Kafka). The most significant difference is that consumer groups in Redis are entirely different from those in Kafka. In Redis, a consumer group is a set of processes that read from the same stream. Redis ensures that events are only delivered to one consumer within the group. For example, in the diagram below, Consumer 1 won't process '9', it will skip it because Consumer 2 has already seen it. Consumer 1 will get the next event not seen by any other group member.

The role of groups in Redis is to parallelize the processing of a single stream. This resembles a traditional job queue structure. Thus, it loses the ordering guarantee essential as a core of stream processing, which is unfortunate.

Stream Processing as a Client Library

So, if Redis only provides topics with job queue semantics effectively, how can we build a stream processing engine on top of Redis? Well, if you want Kafka's features, you need to build them yourself. That means implementing.

  1. Event partitioning. You need to create N streams and treat each stream as a partition. Then, upon sending, you need to decide which partition should receive it, probably based on the event's hash value or one of its fields.
  2. A worker partition assignment system. To scale and support multiple workers, you need to create an algorithm to distribute partitions among them, ensuring each worker has an exclusive subset, akin to Kafka's "rebalancing" system.
  3. Ordered processing with acknowledgment. Each worker needs to iterate through each partition, tracking its offset. Though Redis consumer groups have job queue semantics, they can help here. The trick is to have each group use one consumer, and then create a group for each partition. Then each partition will be processed sequentially, and you can leverage built-in consumer group state tracking. Redis can track not only offsets but also acknowledgments for each event, which is powerful.

This is the absolute minimum requirement. If you want your solution to be robust, you might also consider error handling: in addition to crashing your workers, perhaps you'd want a mechanism to forward errors to a "dead letter" stream and continue processing.

The good news is -- if you're a Python enthusiast -- these problems have been addressed and more in a newly released library called Runnel. If you want to benefit from Kafka-like semantics on Redis, you're welcome to check it out. Below is how it looks, essentially the same as one of the Kafka diagrams above.

Workers coordinate their ownership of partitions via locks implemented in Redis. They communicate with each other through a special "control" stream. For more information, including a detailed breakdown of the architecture and rebalancing algorithm, please refer to the Runnel documentation.

summary

Is Redis a good choice for large-scale event processing? There's a fundamental trade-off: because everything is in memory, you get unparalleled processing speed, but it's not suitable for storing unlimited amounts of data. With Kafka, you might be willing to retain all your events indefinitely, but with Redis, you're definitely storing a fixed window of the most recent events -- just enough for your processors to have a comfortable buffer in case they slow down or crash. This means you might also want to use an external long-term event store, such as S3, to be able to replay them, which adds complexity to your architecture but reduces costs.

The primary motivation for researching this issue is the ease of use and low cost involved in deploying and operating Redis. That's why it's attractive compared to Kafka. It's also a magical toolkit that stood the test of time, quite impressive. It turns out, with effort, it can also support the distributed stream processing paradigm.

Redis
Kafka
Programming
Development
Web Development
Recommended from ReadMedium