Kafka is a persistent, efficient, distributed message queue primarily intended for tracking various activity events generated on website, keywords typed in a search query, ads presented, etc.
There are severals existing queuing solutions. The most popular are based on JMS, but their overhead degrade the performance and this is the biggest benefit that Kafka provides.
Kafka has the following three design principles:
- A very simple API for both producers and consumers
- Low overhead in network transferring as well as on-disk storage.
- A scaled out architecture from the beginning.
Kafka and Hadoop
Scalable persistence allows for the possibility periodically offloading snapshot data into an offline system for batch processing. A Hadoop-based consumer spawns off many map tasks to pull data from the Kafka cluster in parallel. This provides extremely fast pull-based Hadoop data load capabilities. Hadoop then provides task management, which in the event of failure can restart without danger of duplicating data.
Kafka and Zookeeper
Zookeeper watchers can be registered on the following events
- a new broker comes up
- a broker goes down
- a new topic is registered
- a broker gets registered for an existing topic
Internally, the producer maintains an elastic pool of connections to the brokers, one per broker
- Automatic load balancing: Kafka introduced built-in automatic load balancing between the producers and the brokers.
- Asynchronous send: Asynchronous non-blocking operations are fundamental to scaling messaging systems
- Mirroring: A mirror cluster can act as a consumer of one or many source clusters making it possible to join data from multiple datacenters. This tier acts as a buffer between live activity and asynchronous processing.