What Is Apache Kafka?
Apache Kafka is a popular open source platform for streaming, storing, and processing high volumes of data. Kafka was developed by a team of engineers at LinkedIn, and open-sourced in 2011. Thousands of companies around the world including Datadog use Kafka. Businesses powered by Kafka typically generate large amounts of information that must be quickly understood and acted upon.
In the video below, we break down how Kafka works and how it’s able to provide you with a reliable, scalable, and highly efficient service for managing events. We also touch on some key resources for effectively monitoring your Kafka deployments via Datadog.
Kafka Architecture
Within Kafka, each unit of data in the stream is called a message. Messages could be clickstream data from a web app, point-of-sale data for a retail store, user data from a smart device, or any other events that underlie your business. Applications that send the message stream into Kafka are called producers. Kafka servers called brokers receive the stream and write the messages sequentially to immutable log files.
Messages with similar traits may be categorized into groups called topics. Applications called consumers subscribe to topics, and process the messages. You might be familiar with some of this terminology if you’ve used a traditional messaging system or publish-subscribe system.
What are the Advantages of Kafka?
Kafka has some key advantages, primarily in its reliability, scalability, and speed. Below, we will explore each of these advantages.
Kafka Reliability
In a traditional messaging or pub-sub system, the producer sends a message to a queue where it waits for a consumer service to read it. The message is then removed from the queue. This design has some shortcomings. For example, there’s no way to recover messages if the consumer service fails. By contrast, Kafka is a persistent log-based message queue. In this type of system, brokers store incoming messages in the order they are received.
Kafka uses an offset to bookmark a consumer’s progress in processing data, so if the service fails it can easily come back to where it left off without duplicating any effort. As a consumer reads messages and the offset moves, older, read messages are persisted to disk for later analysis by other consumers. Each consumer has its own offset, so multiple consumers can independently read from the same data stream. Kafka also uses a unique system of partitioning and replicating messages to support highly-reliable data streaming.
Let’s look at a real world example: a sales data stream for a retailer. The retailer’s point-of-sale machines generate an event for each transaction. Producer applications continuously feed these sales messages into Kafka. A Kafka broker writes the messages to a topic called “sales,” but instead of every sale being written to a single log file, the topic is partitioned based on a key such as the city in which the sale occurred. The retailer configures the producer to route messages according to this key. This ensures that the brokers write sales from the same city to a specific partition. Each partition gets replicated across a cluster or group of multiple Kafka brokers. Within each partition, one broker acts as the leader and the remaining brokers are followers. The leader handles all the read and write requests for the partition, but if the leader goes down, a follower automatically takes over as the leader. With this fail-safe mechanism, you can reliably stream and store data without having to worry about routine outages.
The consumer side of Kafka also has a fail-safe mechanism. You can create multiple instances of a consumer application to read messages from the same topic. Together, these instances make up a consumer group. Each partition is assigned to one consumer in the group. In our example, one consumer might process sales from City A and another consumer might process sales from City B. Businesses can choose extra consumers to a group. These excess consumers sit idle but will seamlessly take over data processing if any of the active consumers fails. Thanks to the offset, a fallback consumer will know where to start reading incoming messages.
Kafka Scalability
Distributing a Topic’s partitions across many Brokers allows the Topic to scale well beyond any single host. One cluster of Kafka Brokers can host multiple topics, allowing you to scale several unique data streams. Developers can specify the number of partitions in each topic, and Kafka will automatically assign the partitions to existing brokers in the cluster, allowing easy scalability.
Kafka also enables consumer applications to process data at scale. Adding consumer instances to a group increases your processing capacity. Kafka brokers will automatically load-balance partitions among the consumer group, so a topic can be processed at scale. In addition, since multiple Kafka consumers can read data in parallel, you can quickly get different types of business insights. In our example from before, the retailer could use inventory management software and CRM software to process the same sales data at the same time.
Kafka Performance
Another advantage of Kafka is speed. Kafka can stream, store, and process millions of reads and writes every second. Kafka was designed for low latency, and can be optimized for throughput by batching and compressing messages. Additionally, the Kafka fail-safe mechanism mentioned before helps keep data pipelines running smoothly.
How to Monitor Kafka in Production
Kafka leads a good deal of flexibility to developers. For example, after a consumer application processes streams of data, you can feed that data back into Kafka for consumption by other applications. In other words, the consumer of a data stream becomes the producer of another data stream. Hundreds or even thousands of derivative data streams can build on each other.
If your business generates large volumes of data, you can use Kafka to unlock interesting real-time business insights from your data with very little overhead. Companies that adopt Kafka often end up creating complex data pipelines to connect multiple streams of data together. That is the power of Kafka, but this complexity can make it challenging to manage and monitor Kafka deployment. If Kafka is the backbone of your business’s mission critical data-driven applications—as it often is—you need to continuously monitor your Kafka deployment to become aware of issues before your users are impacted.
At Datadog we store petabytes of data in Kafka and stream hundreds of gigabits per second through our cluster. We’ve built a large amount of tooling to improve our operability of our own Kafka infrastructure. You can read our blog post about Kafka kit to learn more. You can also read about the lessons we’ve learned running Kafka at scale and dive into our three-part guide to Kafka monitoring.
You can learn more about integrating Kafka with Datadog to monitor key metrics, logs, and traces from your environment in our documentation.