Database High Availability Architecture Design and Failover Mechanisms

Database High Availability Architecture Design and Failover Mechanisms

Topic Description

High Availability (HA) is a core objective in database system design, referring to the system's ability to continue providing services when local failures occur. Interviews often require designing a high-availability architecture and explaining the failure detection, switchover process, and data consistency guarantee mechanisms.

1. Core Components of High Availability Architecture

(1) Redundant Deployment

Master-Slave Replication: The master node (Primary) handles write operations, while slave nodes (Replicas) synchronize data and provide read services.
Multi-node Deployment: Deploy at least one master with multiple slaves (e.g., one master and two slaves) to avoid single points of failure.

(2) Failure Detection Mechanism

Heartbeat Detection: The master node regularly sends heartbeat packets to slave nodes or monitoring nodes (e.g., once per second).
Timeout Judgment: If timeouts occur consecutively (e.g., 3 unresponsive times), the master node is judged as failed.

(3) Automatic Failover

A switchover controller (e.g., Keepalived, Patroni) automatically promotes a new master node and notifies clients to connect to the new master.

2. Typical High Availability Architecture Solutions

Solution One: Master-Slave + VIP Floating

Steps:
1. The master node is bound to a Virtual IP (VIP), and clients access through the VIP.
2. Upon failure, the controller floats the VIP to a healthy slave node and resets the replication relationship.
Disadvantage: VIP switching may cause brief unavailability due to DNS caching or client reconnection delays.

Solution Two: Based on Consensus Algorithms (e.g., Paxos/Raft)

Examples: MySQL Group Replication, ETCD
1. Nodes form a cluster, and write operations require confirmation from a majority of nodes.
2. When the master node fails, the remaining nodes vote to elect a new master, with data consistency guaranteed by the algorithm.
Advantage: Strong consistency, avoiding Split-Brain scenarios.

3. Detailed Failover Process

Taking master-slave replication with a monitoring node as an example:

Failure Detection:
- The monitoring node checks the master node's liveliness every second.
- After 3 consecutive timeouts, the switchover process is triggered.
Data Consistency Verification:
- Compare the replication lag of slave nodes (e.g., by checking Seconds_Behind_Master via SHOW SLAVE STATUS).
- Select the slave node with the smallest lag as the new master.
Switchover Operations:
- Isolate the Old Master: Forcefully shut down the old master node or revoke its write permissions to prevent data conflicts.
- Promote the New Master: Execute STOP SLAVE; RESET SLAVE ALL; to terminate the replication relationship and enable write permissions.
- Update Routing: Direct client connections to the new master (e.g., modify load balancer configurations).
Recovery and Synchronization:
- After the old master recovers, it rejoins the cluster as a slave node and synchronizes data from the new master.

4. Key Issues and Solutions

(1) Split-Brain Problem

Cause: Network partitioning causes two nodes to simultaneously consider themselves as the master, leading to parallel writes and data conflicts.
Solution:
- Arbitration Mechanism: Introduce a third-party arbitration node (e.g., Zookeeper); only nodes granted arbitration permission can become the master.
- Fencing (Isolation): Forcefully power off or revoke network access to the old master node.

(2) Data Loss Risk

Scenario: In asynchronous replication, if the master node crashes and unsynchronized data remains in the buffer, it may be lost.
Solution:
- Semi-Synchronous Replication: Require at least one slave node to confirm before returning a write success (e.g., MySQL semi-synchronous replication).
- Delayed Replication: Set up delayed synchronization on slave nodes (e.g., 1 hour); in case of failure, retrieve data from the delayed replica.

5. Practical Case: MySQL High Availability Architecture

Architecture: MHA (Master High Availability)

Components:
- MHA Manager: The monitoring node that manages the switchover process.
- MHA Node: An agent deployed on each MySQL node.
Switchover Process:
- The MHA Manager detects the master node failure and selects the slave node with the most recent data.
- Completes unsynchronized data through binary logs (binlog) and promotes the new master.
- Automatically modifies VIP configuration or notifies the application layer to update connections.

Summary

High availability architecture requires balancing consistency, availability, and fault tolerance:

Weak Consistency Scenarios: Master-slave asynchronous replication + VIP floating can be used, offering fast switchover but potential data loss.
Strong Consistency Scenarios: A Raft-based cluster solution is recommended, ensuring data safety but with higher complexity.
The core of failover lies in rapid detection, minimizing data loss, and avoiding split-brain, requiring design trade-offs based on business needs.