Apache Kafka
Kafka is not just a pub-sub system. It’s an event streaming platform .
- It Collects, stores, and processes events in real time.
- Kafka Supports: Distributed Logging, Pub-Sub Messaging’ Stream Processing (via Kafka Streams, ksqlDB)
- Kafka helps build event-driven architectures, where systems react to changes as they occur, not via polling or batch jobs.
- Kafka tools: Kafka Connect, Kafka Streams, Schema Registry (if relevant).
- Highlight real-time capabilities: Low-latency, high-throughput, fault-tolerant.
Event
An event is something that happened—change in state (notification, state)
Examples:
- Thermostat reports temperature → Event
- Invoice becomes overdue → Event
- Mouse hovers over button → Event
- Microservice logs completion → Event
Every event consists of:
- Notification: “This thing happened.”
- State: A structured snapshot of the data related to that occurrence (e.g., JSON, Avro, Protobuf )
Kafka Representation:
Events are represented as key/value pairs (both are byte arrays under the hood):
- Key: Identifier (e.g., user ID, order ID) — helps with partitioning & ordering.
- Value: Actual event data (payload).
Serialization:
- Our applications uses a structured format (e.g., JSON), which is then serialized into byte arrays to be sent to Kafka.
- Deserialization happens when reading data back.
Topic
A topic is a named stream of events in Kafka. It acts like a channel or feed to which events are published. Producers send events to topics; consumers read from topics.
Producers:
Producers are clients/apps that write events to Kafka topics.
- Serialize data into byte arrays.
- Choose a topic (and optionally, a partition or key).
- Send the event using Kafka APIs (Java, Python, REST, etc.).
Consumers:
Consumers are clients/apps that read events from topics. Deserialize events and Track their read position via offsets.
Organized into consumer groups:
- Kafka guarantees each message is read by only one consumer per group.
- Multiple consumers allow for parallel processing.
Topic Compilation :
The details and structure of Kafka topics, how they’re created, configured, and managed.
Key Characteristics of a Topic
- Immutable Log: Messages in a topic are append-only. Events are added at the end .
- Immutable: Once written, they’re not changed.
- Durable Storage: Messages are stored on disk (by default for 7 days, configurable).
- Retention Policy: Time-based (e.g., 7 days) , Size-based (e.g., 1GB per partition)
- Topic-Level ACLs: Kafka allows authorization rules to restrict who can produce/consume a topic.
- Compacted Topics: Store only the latest event per key. Useful for state storage (e.g., user profiles). cleanup.policy=compact
- Compaction vs Deletion
- – delete: Default; messages are removed after retention period
- – compact: Only latest message per key is retained
- Partitioned:
- Topics are split into partitions for scalability and parallelism.
- Each partition is an ordered sequence of events.
- Partitioning enables Kafka to scale horizontally.
- Parallelism: Multiple consumers can read in parallel.
- Ordering: Kafka guarantees message order within a partition (not across partitions).
- Key-based Routing: Events with the same key always go to the same partition.
- Replicated: Each partition can be replicated across multiple brokers. Ensures fault tolerance.
Kafka is designed for horizontal scalability:
- Topics are partitioned → Events are distributed across partitions.
- Each partition can be handled by a separate broker (Kafka server).
- Producers and consumers can operate in parallel: Each consumer in a consumer group can read from different partitions.
- Keys help route messages consistently to partitions, ensuring related data is processed together.
Real-World Use Cases for Topics
Topic Name | Event Type | Producer | Consumer |
orders | Order placed | Web app | Order processor |
user-activity | Clicks, views, page visits | Frontend services | Analytics service |
payment-status | Payment updates | Payment gateway | Billing microservice |
Topic Configuration Options
When creating or managing topics, we can set:
partitions: Number of partitions (default: 1)
replication.factor: Number of replicas (usually ≥ 2 for HA)
retention.ms: How long to retain messages , Data retention duration
cleanup.policy: delete (default) or compact
min.insync.replicas: Minimum replicas required for a successful write
TASK | COMMAND |
Create Topic
| kafka-topics.sh –create –topic my-topic –partitions 3 –replication-factor 2 –bootstrap-server localhost:9092 |
Describe Topic | kafka-topics.sh –describe –topic my-topic –bootstrap-server localhost:9092 |
Delete Topic | kafka-topics.sh –delete –topic my-topic –bootstrap-server localhost:9092 |
Topic Naming Conventions (Best Practices)
- Use descriptive names: order-created, user-signup
- Use dash-separated lowercase words
- Avoid overly generic names like events, data, logs
- Design partitioning strategy based on consumer scaling and key usage
- Monitor topic size and performance regularly
Kafka Connect — Integrating with External Systems
- Kafka Connect is a framework for moving large amounts of data in and out of Kafka.
- Used to integrate Kafka with external systems (e.g., databases, cloud services).
- Two types of connectors:
- Source Connectors: Pull data into Kafka.
- Sink Connectors: Push data out of Kafka.
- Comes with many pre-built connectors (e.g., JDBC, Elasticsearch, S3).