A Guide to Choosing the Right AWS Streaming Service: Kinesis vs MSK
Organizations with a relentless focus on their customers need the ability to collect a variety of data points, process and analyze them, then act upon the data as quickly as possible to improve their customers’ experience. As a result, these organizations are moving away from traditional batch data flows and adopting event-driven data pipelines to enable real time analytics. This is essential in a rapidly competitive and dynamic market — and we must be able to grow with the needs of our clients.
Building event-driven systems is no trivial task. One of the more complex decisions is the careful selection of a scalable, durable, highly available streaming platform that is able to best collect, store, and analyze events. While there are services from multiple cloud providers to choose from, this post focuses on the options within the AWS platform.
OK, What are my options?
There are two AWS services to choose from:
Both services are publish-subscribe (pub-sub) systems, which means producers publish messages to Kinesis/MSK and consumers subscribe to Kinesis/MSK to read those messages. An inherent benefit of adopting pub-sub systems is the decoupling of message producers from message consumers. Producers can produce messages at an incredibly fast pace without worrying about the messages’ downstream consumption, while consumers gain the benefit of consuming messages at a pace that does not overwhelm their resources.
Kinesis is AWS's principal service that provides powerful capabilities to collect, process, and analyze real-time streaming data. It is a fully managed service allowing you to build streaming applications while abstracting away the underlying infrastructure. In the Summer of 2019, AWS announced the release of Managed Streaming for Apache Kafka (MSK). Apache Kafka is a distributed open source streaming platform developed by LinkedIn and later open sourced with the Apache Software Foundation. MSK takes away the operational burden of managing an Apache Kafka cluster.
Either service can scale to process petabyte scale data volumes from a large number of sources with millisecond latency, enabling real-time analytics. However, selecting the right one for your use case requires careful consideration of the effort involved in scaling these services, the ease of development with them and the cost of adopting them in your architecture.
Noteworthy: The original creators of Kafka also started Confluent and built the Confluent Platform as a fully managed enterprise event streaming platform that can be run on AWS, GCP, and Microsoft Azure. For the purpose of this post we are limiting our comparison to AWS services only.
Tell me, which one is better?
There are four key considerations to make when determining which service better fits your use case. Let’s get started!
1. Availability, Scalability, and Ease of Management
Kinesis synchronously replicates data across three availability zones providing high availability and data durability by default. AWS manages the infrastructure, storage, networking, and configuration needed to collect and store streaming data. You start by picking a name for the stream and selecting the number of shards. One shard provides an ingest capacity of 1MB/sec or 1000 records/sec. Kinesis provides auto-scaling capabilities using APIs that can trigger scaling actions based on usage metrics. AWS also provides utilities that can be used to auto-scale a Kinesis stream based on record throughput. The Kinesis streams remain fully functional during the scaling process and producers & consumers can continue to read/write to the streams during these operations.
MSK requires a cluster sizing exercise prior to resource provisioning. This exercise must factor in your use case, availability, and scaling needs. AWS provides guidelines to size your cluster, but these tend to be a directional starting point. Identifying the right cluster size is an iterative process and requires cluster management expertise. There are some MSK defaults that you can rely on to gain high availability. For example, when you configure an MSK cluster, the brokers are spread across three availability zones by default. If one availability zone goes down, the system is able to recover with no data loss — as long as replication has been enabled. Using MSK APIs you can scale out your MSK cluster by adding more brokers. At the time of writing this post, MSK cluster brokers cannot be scaled up.
If you are starting with real-time streaming architectures and are working with low volumes of data, Kinesis is a preferred service because of its ease of use and minimal operational management. However, if you currently operate a Kafka cluster on-premise or have high volume workloads and are evaluating your options in AWS, MSK may be the correct choice.
2. Integration with other AWS Services
Real-time stream processing is an essential component of streaming pipelines. Computations like filters, joins, type conversions, and aggregation windows are ubiquitous operations that derive insights from data. Your choice of platform affects the stream processors available to you. Kinesis tightly integrates with multiple AWS services, making processing real-time data simple and accessible. This integration with the AWS ecosystem drastically reduces the time it takes to set up your data pipeline and start serving value back to customers.
With Kinesis, you can build streaming applications using:
- Kinesis Firehose: To load data into S3/Redshift/Amazon ElasticSearch.
- Kinesis Data Analytics: To build and deploy SQL or Flink applications.
- AWS Lambda: Serverless compute-to-perform custom stream processing.
- AWS EMR: To process big data leveraging the Spark or Flink framework.
- EC2 / Fargate / EKS: To build custom streaming applications.
With MSK, you can build streaming applications using:
- Kinesis Data Analytics: To build and deploy SQL or Flink applications.
- Amazon EMR: To process big data leveraging the Spark or Flink framework.
- EC2 / Fargate / EKS: To build custom streaming applications. There are some very powerful, easy to use, open source streaming frameworks specific to Kafka, like KSQL and the Streams API that expedite pipeline development. However, these do require operational expertise to be deployed in a scalable manner.
Since Kinesis is AWS’s principal streaming service, it is no surprise that it provides superior integration with other AWS services. If these AWS services are already part of your organization’s toolkit, your timeline to build and deploy a real-time analytics solution will be significantly shorter in comparison.
3. Price and Cost
As a fully-managed streaming service, Kinesis uses a pay-as-you-go pricing model. Pricing is based on Shard-Hour and per 25KB payload. One shard provides ingest capacity of 1MB/sec or 1000 records/sec. A Shard Hour is the number of shards used by your stream, charged at an hourly rate. With MSK, you pay for the number of instances in your cluster and the storage volumes attached to them. Use these pricing examples to calculate the cost for your use case:
For low volume workloads (up to 10 MB/sec), Kinesis is cheaper to set up and operate compared to the fixed cost of setting up and operating an MSK cluster to process the same volume. The smallest recommended MSK cluster you can provision currently is a 3x m5.large cluster, which is more expensive than the minimum 1 shard stream you can create with Kinesis.
For medium volume workloads (up to 100 MB/sec) and high volume workloads (up to 1000 MB/sec and above), there are other factors that play a role in determining cost. Kafka configuration settings dictate the performance you gain from your MSK cluster — number of partitions, replication factor, compression type and security protocol are a few of many configurations to consider. The smaller you can optimize your MSK cluster, the less you pay for it. With Kinesis, there is little configuration needed in comparison, so the cost scales directly with the number of shards used.
It is important to consider the limitations of the Kinesis service, which make it an expensive solution at scale. A shard in Kinesis supports a consumer reading data at a maximum of 2 MB/sec. To gain 4 MB/sec read performance you would have to distribute your data across 2 shards, thereby doubling your cost. By default that 2MB/second/shard output is shared between ALL the consumers of data in that shard. If you need to provide each consumer 2 MB/sec throughput you have to use enhanced fan-out at an additional price which makes the service very expensive. With MSK, you can configure your data to be retrieved at a lower per unit price at higher throughput in comparison.
Eventually, true cost calculations rely on the cost of teams needed to support and maintain the service along with implementing best practices, efficient management of resource and prioritizing cost optimization efforts while maintaining elasticity to meet customer demand.
4. Message Delivery
Message delivery semantics are critical to building fault-tolerant streaming data pipelines. Both Kinesis and MSK are distributed streaming platforms where producers, consumers, and the streaming platform itself can fail independent of each other. Therefore, it is critical to understand the design implications of message delivery guarantees for a streaming platform to be able to build fault-tolerant real-time analytics applications. There are three message delivery semantics:
- At-least-once delivery — In the event a system fails or a network issue occurs, a message producer may continue to retry a message until it receives a successful acknowledgement. This can cause message duplication in the streaming platform. Hence, consumer applications must handle these scenarios by explicitly de-duplicating messages.
- At-most-once delivery — If message producers are configured to not retry messages, it can lead to data loss if the streaming platform fails to commit and acknowledge the message. Consumers are only guaranteed messages that were successfully written to the streaming platform. Data loss is a serious problem for most businesses but its significance can vary based on your use case, specially if the original message can be recovered or reproduced easily.
- Exactly-once delivery — The perfect world where a producer sends a message, it is written exactly once to the streaming platform and no duplication of messages or data loss occurs when a consumer reads that message.
Kinesis provides at-least-once message delivery while MSK (Kafka) provides exactly-once delivery. The amount of complexity you are willing to take on in building your application will help inform your decision. If you select Kinesis, your application must anticipate and handle duplicate records using the guidance provided. If you select MSK (Kafka), it is important to read and understand the usage of its transactions API to utilize exactly-once delivery.
Conclusion
There may be other service specific features such as producer/consumer libraries written in your preferred programming language or ease-of-monitoring that may seem extremely appealing, which makes this decision a challenging one. Which is why I recommend staying laser-focused on the objective that started you on your journey — improving customer experience. If you’re looking for a solution that can quickly get you started, is fully managed, and does not require multidisciplinary expertise, consider Kinesis. If you have the resources to build and manage a highly configurable streaming platform and find value in adopting open source technology (Kafka), MSK will serve you better.