We have been working on a project evaluating different technologies for messaging and stream processing of data with constraints on message loss. As part of the research we evaluated Apache Kafka as a distributed, durable messaging system. We already introduced this technology developed at LinkedIn in a short post before, and we wanted to provide some more insight into it from the messaging point of view.

Kafka is a distributed system. When choosing which CAP properties to take, its designers prioritized consistency and high availability, since network partitioning in a datacenter is rare.

Partitions are replicated by a configurable replication factor. Every partition has its own Master node. If a partition Master goes down another replica will take its role. With replication, Kafka clients “are able to publish messages during failure and can choose between latency and durability as well as are able to retrieve messages in real time, even when there is failure”.

Kafka introduces some changes on how messaging was thought up to the moment. MQ solutions usually require us to register consumers, so that messages are delivered to them. Once the message is delivered, it is deleted. Kafka takes another approach. Messages are just persisted for a topic, and anyone can consume them. Messages are not deleted when delivered, but based on a time policy. This way the broker does not need to maintain metadata about consumers and consumers may request a message multiple times if needed, access older messages when connecting to a topic or retrieve messages in batches. This means Kafka is better decoupled from external systems, since does not need to maintain any information about them. Since messages are persisted, if Kafka has a good enough window of time (LinkedIn stores messages for a week) there won’t be any message loss.

Kafka achieves great performance even for terabytes of data using a simple storage (each topic has an ever-growing log which consists of a set of files,  where the messages are addressed by a log offset) and careful transfer (send and fetch operations are performed in a batch mode relying on filesystem cache and performing file to socket transfers).

Did you use Kafka for a project? Interested on working with Big data technologies? We want to hear from you! Drop us an email! If you want to dig deeper into this topic, we suggest the following links:

facebooktwittergoogle_plusredditlinkedinby feather

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>