Database High Availability Architecture Design and Failover Mechanisms
Database High Availability Architecture Design and Failover Mechanisms
Topic Description
High Availability (HA) is a core objective in database system design, referring to the system's ability to continue providing services when local failures occur. Interviews often require designing a high-availability architecture and explaining the failure detection, switchover process, and data consistency guarantee mechanisms.
1. Core Components of High Availability Architecture
(1) Redundant Deployment
- Master-Slave Replication: The master node (Primary) handles write operations, while slave nodes (Replicas) synchronize data and provide read services.
- Multi-node Deployment: Deploy at least one master with multiple slaves (e.g., one master and two slaves) to avoid single points of failure.
(2) Failure Detection Mechanism
- Heartbeat Detection: The master node regularly sends heartbeat packets to slave nodes or monitoring nodes (e.g., once per second).
- Timeout Judgment: If timeouts occur consecutively (e.g., 3 unresponsive times), the master node is judged as failed.
(3) Automatic Failover
- A switchover controller (e.g., Keepalived, Patroni) automatically promotes a new master node and notifies clients to connect to the new master.
2. Typical High Availability Architecture Solutions
Solution One: Master-Slave + VIP Floating
- Steps:
- The master node is bound to a Virtual IP (VIP), and clients access through the VIP.
- Upon failure, the controller floats the VIP to a healthy slave node and resets the replication relationship.
- Disadvantage: VIP switching may cause brief unavailability due to DNS caching or client reconnection delays.
Solution Two: Based on Consensus Algorithms (e.g., Paxos/Raft)
- Examples: MySQL Group Replication, ETCD
- Nodes form a cluster, and write operations require confirmation from a majority of nodes.
- When the master node fails, the remaining nodes vote to elect a new master, with data consistency guaranteed by the algorithm.
- Advantage: Strong consistency, avoiding Split-Brain scenarios.
3. Detailed Failover Process
Taking master-slave replication with a monitoring node as an example:
-
Failure Detection:
- The monitoring node checks the master node's liveliness every second.
- After 3 consecutive timeouts, the switchover process is triggered.
-
Data Consistency Verification:
- Compare the replication lag of slave nodes (e.g., by checking
Seconds_Behind_MasterviaSHOW SLAVE STATUS). - Select the slave node with the smallest lag as the new master.
- Compare the replication lag of slave nodes (e.g., by checking
-
Switchover Operations:
- Isolate the Old Master: Forcefully shut down the old master node or revoke its write permissions to prevent data conflicts.
- Promote the New Master: Execute
STOP SLAVE; RESET SLAVE ALL;to terminate the replication relationship and enable write permissions. - Update Routing: Direct client connections to the new master (e.g., modify load balancer configurations).
-
Recovery and Synchronization:
- After the old master recovers, it rejoins the cluster as a slave node and synchronizes data from the new master.
4. Key Issues and Solutions
(1) Split-Brain Problem
- Cause: Network partitioning causes two nodes to simultaneously consider themselves as the master, leading to parallel writes and data conflicts.
- Solution:
- Arbitration Mechanism: Introduce a third-party arbitration node (e.g., Zookeeper); only nodes granted arbitration permission can become the master.
- Fencing (Isolation): Forcefully power off or revoke network access to the old master node.
(2) Data Loss Risk
- Scenario: In asynchronous replication, if the master node crashes and unsynchronized data remains in the buffer, it may be lost.
- Solution:
- Semi-Synchronous Replication: Require at least one slave node to confirm before returning a write success (e.g., MySQL semi-synchronous replication).
- Delayed Replication: Set up delayed synchronization on slave nodes (e.g., 1 hour); in case of failure, retrieve data from the delayed replica.
5. Practical Case: MySQL High Availability Architecture
Architecture: MHA (Master High Availability)
- Components:
- MHA Manager: The monitoring node that manages the switchover process.
- MHA Node: An agent deployed on each MySQL node.
- Switchover Process:
- The MHA Manager detects the master node failure and selects the slave node with the most recent data.
- Completes unsynchronized data through binary logs (binlog) and promotes the new master.
- Automatically modifies VIP configuration or notifies the application layer to update connections.
Summary
High availability architecture requires balancing consistency, availability, and fault tolerance:
- Weak Consistency Scenarios: Master-slave asynchronous replication + VIP floating can be used, offering fast switchover but potential data loss.
- Strong Consistency Scenarios: A Raft-based cluster solution is recommended, ensuring data safety but with higher complexity.
The core of failover lies in rapid detection, minimizing data loss, and avoiding split-brain, requiring design trade-offs based on business needs.