Multi-Version Concurrency Control (MVCC) for Data in Distributed Systems
Description
MVCC is a technique used to handle concurrent read and write operations, improving system concurrency performance by maintaining multiple versions of data. In distributed systems, MVCC is widely used in databases (such as Google Spanner, CockroachDB) and distributed storage systems to achieve non-blocking read operations and high-concurrency transaction processing. Its core idea is that each write operation creates a new version of data, while read operations access an old, yet consistent, snapshot version, thereby avoiding mutual blocking between read and write operations.
Problem-Solving Process
-
Basic Concepts: Why is MVCC needed?
- Problem: In traditional locking mechanisms, read and write operations block each other. For example, a long write operation can block subsequent read operations, degrading read performance. Similarly, a read operation can also block writes. In high-concurrency distributed environments, such blocking severely limits system throughput and responsiveness.
- Solution Approach: The core insight of MVCC is that read and write operations do not inherently need to conflict directly. If a read operation only needs to read the state of the data at a certain past point in time, it does not need to care whether a write operation is currently modifying the latest state of the data. Based on this, MVCC allows multiple versions of data to coexist.
-
Core Mechanisms: Versions and Snapshots
- Data Versioning: MVCC systems maintain multiple historical versions for each piece of data (e.g., a row in a database). Each update (UPDATE) or delete (DELETE) operation does not directly overwrite or remove old data but creates a new version. An insert (INSERT) operation creates the first version.
- Version Identification: Each version requires a unique identifier to determine its "effective time" and "expiration time". The most commonly used identifiers are:
- Timestamp: Using physical clocks or hybrid logical clocks.
- Monotonically Increasing Transaction ID (TID): The system assigns a globally unique, incrementing ID to each transaction.
- Snapshot Isolation: Each transaction, upon starting, is assigned a "snapshot". This snapshot defines the range of data versions visible to that transaction. Typically, a transaction can only see data versions that were committed before it started. This provides the transaction with a consistent, static view of the database, as if a snapshot of the entire database was taken at the transaction's start.
-
MVCC Workflow
Let's understand through an example. Assume the system uses monotonically increasing Transaction IDs (TID) to mark versions.- Initial State: Data item
Xhas an initial value of5, inserted by transaction T1 (TID=10). We record it as versionX@10 = 5. - Transaction T2 (Write Transaction, TID=20):
- T2 wants to update
Xto10. - It does not directly modify
X@10but creates a new versionX@20 = 10. - The system records that version
X@10becomes "invalid" after TID=20, and the new versionX@20becomes "effective" from TID=20.
- T2 wants to update
- Transaction T3 (Read Transaction, started after T1 committed but before T2 commits, its snapshot TID=15):
- T3 wants to read
X. - The system traverses
X's version chain: the latest version isX@20, but its effective TID=20 > T3's snapshot TID=15, so T3 cannot see this version. The system continues searching, findsX@10, its effective TID=10 <= T3's snapshot TID=15, so T3 reads the value5. - Key Point: Even though T2 might be in the process of committing
X@20, T3's read operation is completely unaffected because it accesses an old, committed version. This achieves non-blocking reads.
- T3 wants to read
- Initial State: Data item
-
Version Garbage Collection
- Problem: If new versions are created indefinitely, storage space will be rapidly exhausted. Therefore, the system needs to periodically clean up old versions that are no longer needed by any transaction.
- Solution: The system needs to maintain information about the oldest snapshot TID (or oldest start time) among all currently running transactions. Any data version whose effective time is earlier than this "oldest snapshot TID" and has been overwritten by a newer version is absolutely safe to delete because no ongoing transaction will need to access it. This cleanup process is often called "vacuuming".
-
Challenges and Enhancements of MVCC in Distributed Environments
- Global Version Numbers/Timestamps: In a distributed system, assigning a globally monotonically increasing TID or a globally consistent timestamp to each transaction is itself a challenge. This typically requires mechanisms like the TrueTime API (e.g., Spanner), Hybrid Logical Clocks (HLC), or a centralized timestamp granting service.
- Distributed Garbage Collection: The garbage collection process becomes more complex in a distributed environment because it requires coordinating multiple nodes to determine a global "oldest snapshot" boundary, after which each node independently cleans up its local expired versions.
- Integration with Consensus: In distributed databases based on state machine replication (like Raft), data updates (i.e., creating new versions) must be replicated to a majority of nodes via a consensus protocol before the version is considered "committed" and effective. Read operations can then read from the local state, as long as it is ensured that the read version is committed and meets the requirements of the snapshot isolation level.
Summary
MVCC cleverly decouples read and write operations through data multi-versioning and snapshot isolation, making it a key technology for building high-performance, high-concurrency distributed storage systems. Its core advantage lies in providing consistent snapshots for read operations, where reads do not block writes and writes do not block reads. Implementing MVCC requires solving problems related to version management, snapshot determination, and garbage collection. In distributed environments, these problems are closely intertwined with foundational distributed issues like global ordering and data replication.