Architecture Design and High Availability Assurance of Kafka Message Queue

Architecture Design and High Availability Assurance of Kafka Message Queue

Problem Description
Kafka is a high-throughput, distributed message queue system widely used in real-time data stream processing. An interviewer may ask you to explain the core architectural components of Kafka (such as Broker, Topic, Partition, Producer/Consumer, etc.), and focus on explaining its High Availability implementation mechanisms, such as Replication, ISR (In-Sync Replicas), and Leader election strategies.

Solution Process

Core Architectural Components
- Broker: A single server node within a Kafka cluster, responsible for message storage and forwarding.
- Topic: The logical categorization of messages. Producers send messages to a Topic, and consumers subscribe to messages from a Topic.
- Partition: Each Topic can be divided into multiple Partitions. Each Partition is an ordered, immutable message queue. Partitioning allows Topics to scale horizontally, enhancing concurrent processing capabilities.
- Producer: A client that publishes messages to a Topic. It can specify which partition to send a message to (via key hashing or direct specification).
- Consumer: A client that consumes messages from a Topic. Consumers are organized into Consumer Groups, and partitions are balanced among consumers within the same group.
High Availability Foundation: Replication Mechanism
- Each Partition is configured with multiple Replicas, including one Leader and multiple Followers.
- The Leader handles all read and write requests, while Followers pull data backups from the Leader asynchronously or synchronously.
- Replicas are distributed across different Brokers (via Broker configuration) to avoid single points of failure.
ISR (In-Sync Replicas) Mechanism
- The ISR is the set of replicas currently synchronized with the Leader (including the Leader itself).
- Followers must periodically send heartbeats to the Leader. If a Follower lags beyond a threshold (replica.lag.time.max.ms), it is removed from the ISR.
- Producers can configure the acks parameter to control reliability:
  - acks=0: Does not wait for acknowledgment; messages may be lost.
  - acks=1: Waits only for Leader acknowledgment; Followers may not have synced.
  - acks=all: Waits for acknowledgment from all replicas in the ISR, ensuring no data loss.
Leader Election and Fault Recovery
- When a Leader fails, the Controller (a Broker elected within the cluster) elects a new Leader from the ISR.
- The election strategy prioritizes replicas within the ISR. If the ISR is empty, an unclean.leader.election may be triggered (potentially causing data loss).
- Once the new Leader becomes active, other Followers synchronize data from it, restoring the ISR state.
High Availability Design Summary
- Data Persistence: Messages are directly appended to disk log files, avoiding the risk of memory loss.
- Partitioning and Load Balancing: Multiple partitions are distributed across different Brokers, achieving load balancing and fault isolation.
- Consumer Offset Management: Consumers commit their consumption progress (Offset) to an internal Topic (__consumer_offsets), whose reliability is managed by Kafka.

Example Scenario
Assume a Topic named order-events has 3 partitions (P0, P1, P2), each configured with a replication factor of 3.

During normal operation, the Leader for P0 is on Broker1, with Followers on Broker2 and Broker3.
If Broker1 fails, the Controller elects Broker2 (from the ISR containing Broker2 and Broker3) as the new Leader, and service continues.