Kafka

Back to Data-Science

Kafka is a streaming platform. It's a queue with a lot of features. Queues in infrastructure have a lot of benefits. They enable services to be decoupled, and holds in transit messages if services fail. Kafka is a queue that allows multiple consumers of the same data (without the need for multiple queues). This also means it can be the centralized feed for all data being produced.

Kafka is publish-subscribe messaging system. Messages come from some set of producers, stored into partitions called topics. Abstractly, Kafka is a append-only log which can be read from by consumers at some offset.

Each topic can have multiple subscribers, each with their own client ID with an associated cursor for how much of the topic it has consumed so far. This means that multiple subscribers will each get a copy of the published messages.

Kafka is for realtime streaming systems:

In addition to a producer API and Consumer API:

Guarantees

Infrastructure

Kafka is a cluster that can span regions. Consumer groups are logical consumers that generalize the tradeoffs between queuing and pub-sub models. They allow multiple subscribers and also allow processing to scale horizontally.