Architecture Design and High Availability Assurance of Kafka Message Queue

Architecture Design and High Availability Assurance of Kafka Message Queue

Problem Description
Kafka is a high-throughput, distributed message queue system widely used in real-time data stream processing. An interviewer may ask you to explain the core architectural components of Kafka (such as Broker, Topic, Partition, Producer/Consumer, etc.), and focus on explaining its High Availability implementation mechanisms, such as Replication, ISR (In-Sync Replicas), and Leader election strategies.

Solution Process

  1. Core Architectural Components

    • Broker: A single server node within a Kafka cluster, responsible for message storage and forwarding.
    • Topic: The logical categorization of messages. Producers send messages to a Topic, and consumers subscribe to messages from a Topic.
    • Partition: Each Topic can be divided into multiple Partitions. Each Partition is an ordered, immutable message queue. Partitioning allows Topics to scale horizontally, enhancing concurrent processing capabilities.
    • Producer: A client that publishes messages to a Topic. It can specify which partition to send a message to (via key hashing or direct specification).
    • Consumer: A client that consumes messages from a Topic. Consumers are organized into Consumer Groups, and partitions are balanced among consumers within the same group.
  2. High Availability Foundation: Replication Mechanism

    • Each Partition is configured with multiple Replicas, including one Leader and multiple Followers.
    • The Leader handles all read and write requests, while Followers pull data backups from the Leader asynchronously or synchronously.
    • Replicas are distributed across different Brokers (via Broker configuration) to avoid single points of failure.
  3. ISR (In-Sync Replicas) Mechanism

    • The ISR is the set of replicas currently synchronized with the Leader (including the Leader itself).
    • Followers must periodically send heartbeats to the Leader. If a Follower lags beyond a threshold (replica.lag.time.max.ms), it is removed from the ISR.
    • Producers can configure the acks parameter to control reliability:
      • acks=0: Does not wait for acknowledgment; messages may be lost.
      • acks=1: Waits only for Leader acknowledgment; Followers may not have synced.
      • acks=all: Waits for acknowledgment from all replicas in the ISR, ensuring no data loss.
  4. Leader Election and Fault Recovery

    • When a Leader fails, the Controller (a Broker elected within the cluster) elects a new Leader from the ISR.
    • The election strategy prioritizes replicas within the ISR. If the ISR is empty, an unclean.leader.election may be triggered (potentially causing data loss).
    • Once the new Leader becomes active, other Followers synchronize data from it, restoring the ISR state.
  5. High Availability Design Summary

    • Data Persistence: Messages are directly appended to disk log files, avoiding the risk of memory loss.
    • Partitioning and Load Balancing: Multiple partitions are distributed across different Brokers, achieving load balancing and fault isolation.
    • Consumer Offset Management: Consumers commit their consumption progress (Offset) to an internal Topic (__consumer_offsets), whose reliability is managed by Kafka.

Example Scenario
Assume a Topic named order-events has 3 partitions (P0, P1, P2), each configured with a replication factor of 3.

  • During normal operation, the Leader for P0 is on Broker1, with Followers on Broker2 and Broker3.
  • If Broker1 fails, the Controller elects Broker2 (from the ISR containing Broker2 and Broker3) as the new Leader, and service continues.