Database High Availability Architecture and Failure Recovery Mechanisms

Database High Availability Architecture and Failure Recovery Mechanisms

Problem Description:
High Availability (HA) is a key metric for measuring a database system's ability to provide continuous service. This topic requires you to understand how database systems minimize downtime through specific architectural designs, and how the system automatically and rapidly performs fault detection and recovery in the event of unexpected situations such as hardware failures, network outages, or data corruption, thereby ensuring business continuity. The core lies in mastering several mainstream high-availability architecture patterns and their underlying failover mechanisms.

Solution Process / Knowledge Explanation:

Step 1: Understand the Core Goals and Metrics of High Availability

Goal: The fundamental goal of high availability is to eliminate "Single Points of Failure" (SPOF) in the system. The failure of any single component (e.g., server, disk, network switch) should not cause the entire database service to be unavailable for an extended period.
Metrics: Availability is typically measured in "number of nines".
- 99.9%: Approximately 8.76 hours of annual downtime.
- 99.99%: Approximately 52.6 minutes of annual downtime.
- 99.999%: Approximately 5.26 minutes of annual downtime.
  Database systems strive to achieve 99.99% or even 99.999% availability through high-availability architectures.

Step 2: Master the Foundational Technical Components for Achieving High Availability

Before building complex architectures, it's essential to understand several foundational technologies:

Data Redundancy: This is the cornerstone of high availability. Without data copies, a standby node cannot provide service if the primary node fails. Key technologies include:
- Master-Slave Replication: You are already familiar with its principle. A primary node (Master) handles write operations and propagates data changes asynchronously or synchronously to one or more replica nodes (Slaves). Slave nodes are primarily used for read scaling and backup. This is the most common foundation for high availability.
- Log Shipping: Periodically copying the primary database's transaction log backup files to another standby server and restoring them. This is a simpler but slower-switching redundancy method.
Failure Detection: The system needs to quickly detect node failures.
- Mechanism: Typically implemented via a "Heartbeat" mechanism. Primary and standby nodes regularly (e.g., once per second) send small network packets (heartbeat packets). If a standby node does not receive a heartbeat from the primary within a predetermined time (e.g., 3 heartbeat intervals), it assumes the primary node "may" have failed.
Failover: The process of switching service traffic from a failed node to a healthy node after detecting a primary node failure.
- Manual Failover: Performed manually by operations staff who confirm the failure and execute the switch command. Recovery time is longer.
- Automatic Failover: The system automatically performs failure detection and switching. This is key to a high-availability architecture.
Virtual IP (VIP) or Domain Name: To hide the real IP addresses of backend servers from clients, clients do not connect directly to the primary database's IP. Instead, they connect to a Virtual IP or a domain name. During failover, only this VIP needs to be drifted to the new primary node, or the DNS resolution needs to be updated; clients do not need to modify their configuration.

Step 3: Learn Mainstream High-Availability Architecture Patterns

Based on the above components, several typical architectures have emerged:

Master-Slave Replication + Sentinel/Monitor
- Description: Builds upon master-slave replication by introducing one or more independent "Sentinel" processes. Sentinels are specifically responsible for monitoring the health status of primary and replica nodes.
- Workflow:
  - Monitoring: Multiple Sentinel instances simultaneously monitor the primary node.
  - Subjective Down: A Sentinel instance considers the primary node unresponsive and marks it as "subjectively down."
  - Objective Down: When a sufficient number (e.g., a majority) of Sentinels consider the primary node down, it is marked as "objectively down," confirming a genuine failure.
  - Election of a Leader Sentinel: The Sentinel cluster elects a leader to execute the failover.
  - Failover: The leader Sentinel selects a healthy slave node based on rules (e.g., largest replication offset, highest priority) and promotes it to be the new primary node.
  - Configuration Switch: Instructs other slave nodes to replicate from the new primary and notifies clients (by updating the VIP or publishing a message) to connect to the new primary.
- Representative: Redis Sentinel is a classic implementation of this pattern.
Scheme Based on Cluster Coordination Services (e.g., ZooKeeper, etcd)
- Description: Uses a highly available distributed coordination service (e.g., ZooKeeper) to store cluster metadata (e.g., who is the current primary) and manage distributed locks. Database nodes interact with ZooKeeper as clients.
- Workflow:
  - Upon startup, all database nodes register an ephemeral node in ZooKeeper. The one that successfully creates the specific node representing the "primary node" becomes the primary.
  - Other nodes watch this primary node. If the primary node fails, its session with ZooKeeper ends, and the ephemeral node it created automatically disappears.
  - Slave nodes watching this node immediately receive notification and then compete to try to create a new primary node, completing automatic election and switchover.
- Advantage: Leverages ZooKeeper's own high availability and strong consistency, making it very reliable. Many distributed databases (e.g., Kafka) and middleware use this pattern.
Shared-Storage Architecture (Shared-Disk)
- Description: The primary and standby nodes share the same set of data files stored on high-end storage devices (e.g., SAN). However, typically only the primary node can mount and read/write these files.
- Workflow:
  - The primary node services normally, exclusively accessing the shared storage.
  - When the primary node fails, high-availability management software (e.g., Pacemaker) ensures the connection between the former primary and the shared storage is severed.
  - Then, the management software mounts the shared storage to the standby node and starts the database service on the standby node.
- Advantage: Only one copy of data exists, eliminating concerns about data synchronization delays.
- Disadvantage: The shared storage itself is a potential single point of failure (requires storage high availability) and is costly. Representatives include Oracle RAC (a more advanced shared-everything architecture).
Multi-Master Replication / Distributed Consensus Protocols (e.g., Paxos, Raft)
- Description: This is a more modern, thoroughly decentralized approach. In the cluster, there are no traditional "primary" and "slave" nodes; all nodes are peers and can accept read and write requests. They use a consensus algorithm to ensure data consistency among multiple replicas.
- Brief Analysis of the Raft Algorithm:
  - Cluster nodes have three roles: Leader (equivalent to primary), Follower, Candidate (a temporary role during elections).
  - Election Process: All nodes start as Followers. If no heartbeat is received from a Leader, a Follower waits for a random timeout period, then transitions to Candidate and initiates an election. The Candidate receiving votes from a majority (N/2+1) becomes the new Leader.
  - Data Synchronization: All write requests are sent to the Leader. The Leader replicates operations as log entries to all Followers. Only after a majority of nodes have persisted the log does the Leader commit the entry and notify Followers to apply the change. This ensures data is not lost even if a minority of nodes fail.
- Advantage: Automatic failover, strong data consistency, no single point of failure.
- Representatives: ETCD, Consul themselves use Raft. The cores of many NewSQL databases (e.g., TiDB, CockroachDB) are also based on such protocols.

Step 4: Comparative Summary and Selection Considerations

The choice of architecture depends on business requirements:

Consistency Requirements: Is strong consistency required (e.g., financial systems)? Raft/Paxos protocols guarantee it, while asynchronous master-slave replication may lose data.
Performance Requirements: Synchronous replication ensures strong consistency but has higher latency; asynchronous replication has lower latency but may lead to inconsistency.
Recovery Time Objective (RTO): Architectures with automatic switching (Sentinel, Raft) can complete failover in seconds, while manual switching or log shipping may take minutes.
Risk of Data Loss (RPO): RPO ≈ 0 for synchronous replication, RPO > 0 for asynchronous replication.
Cost and Complexity: Shared storage is costly; distributed database architectures based on Raft are complex but powerful.

Understanding the components, workflows, and trade-offs of these architectures enables you to design or select the most suitable high-availability database solution for a specific scenario.