Multiprocessor Synchronization and Locking Mechanisms in Operating Systems

Multiprocessor Synchronization and Locking Mechanisms in Operating Systems

In multiprocessor systems, multiple CPU cores may access shared data simultaneously, presenting more complex synchronization challenges than single-processor systems. We will delve into this problem and its core solutions.

1. Problem Background: Why is Multiprocessor Synchronization More Complex?

In single-processor systems, although there is thread concurrency, only one thread runs on the CPU at any given moment. Even if multiple threads incorrectly access shared data, causing race conditions, the alternating execution of threads is serial in time, which may temporarily mask the problem with simple techniques like "disabling interrupts" or software algorithms. However, in multiprocessor systems, multiple threads execute truly in parallel on different CPUs, and their read/write operations on shared memory can occur at the same physical moment, making race conditions fully exposed and more severe.

Core Challenge: Ensuring that modifications to shared data by a thread running on one CPU can be correctly and timely observed by threads running on other CPUs. This involves two key concepts: cache coherence and memory barriers.

2. Foundational Tool: Atomic Operations

Atomic operations are the building blocks for all high-level synchronization primitives (such as locks).

Definition: An atomic operation is indivisible during its execution. From the perspective of the rest of the system (including other processors), this operation either has not executed at all or has fully completed; no intermediate state is visible.
Hardware Support: Modern processors provide special instructions to implement atomic operations, for example:
- Test-and-Set: Atomically sets a memory location to a new value and returns its old value.
- Compare-and-Swap (CAS): Atomically compares the value of a memory location with an expected value; if they are equal, it swaps in a new value.
- Fetch-and-Add: Atomically fetches the value of a memory location and increments it by a specified amount.

3. Evolution of Spinlocks: From Simple Implementation to Scalability Optimizations

We use atomic operations to build the simplest lock—the spinlock. When a thread attempts to acquire a lock that is already held, it repeatedly "spins" in a loop checking the lock until it is released.

Simple Spinlock (Non-viable Version):
The initial idea might be to use a simple boolean variable lock (0 for free, 1 for busy).
```
// Incorrect example!
void lock(int *lock) {
    while (*lock == 1); // Wait for lock to be released
    *lock = 1;         // Acquire lock
}
void unlock(int *lock) {
    *lock = 0;
}
```
Problem: The two operations check lock status and set lock are not atomic. Two threads might simultaneously see lock as 0 and both set it to 1, allowing both threads to enter the critical section.
Correct Spinlock Using Test-and-Set:
We use atomic operations to solve the above problem.
```
void spin_lock(int *lock) {
    while (test_and_set(lock) == 1); // If the old value was 1, keep spinning; if it was 0, acquisition succeeded.
}
void spin_unlock(int *lock) {
    *lock = 0;
}
```
How it works: The test_and_set instruction atomically performs the "read-modify-write" operation. Only the first thread executing this instruction reads 0 and sets it to 1, thus acquiring the lock. Other threads will only read 1 and continue looping.
Performance Issues of Simple Spinlocks:
While the correct spinlock above works on small systems, it suffers from severe scalability issues on multiprocessors:
1. Bus/Memory Bandwidth Contention: All CPUs waiting for the lock are constantly executing the test_and_set instruction. This instruction requires sending "read-modify" requests to the bus or memory controller, generating significant bus traffic and cache coherence traffic (invalidating cache copies in other CPUs), severely slowing down the entire system.
2. Fairness Problem: There is no guarantee that the thread waiting the longest will acquire the lock first.

4. Advanced Locking Mechanisms: Improving Scalability and Fairness

To address the problems of simple spinlocks, researchers have designed more advanced locks.

Queueing Lock (MCS Lock):
The MCS lock avoids global contention by having each waiting thread spin on a local variable.
- Data Structure: The lock maintains a queue (can be implicit). Each thread wanting to acquire the lock creates its own node (containing a locked flag) and uses an atomic operation to add this node to the tail of the queue.
- Workflow:
  1. Acquiring the lock: If the queue is empty, the current thread acquires the lock directly. Otherwise, it sets the next pointer of its predecessor's node to point to itself and then spins on its own node's locked flag (waiting).
  2. Releasing the lock: When a thread releases the lock, it checks if there is a successor node waiting. If so, it wakes up the successor thread by modifying its locked flag (typically setting it to 0).
- Advantages: Each thread spins on a different memory address, eliminating the cache line "ping-pong effect" of "test-and-test-and-set" locks, resulting in excellent scalability.
Reader-Writer Locks:
Designed for "read-mostly" scenarios, reader-writer locks allow multiple readers to enter the critical section simultaneously, but writers require exclusive access.
- Implementation Idea: The lock internally maintains a reader counter and a writer flag. To acquire a read lock, the reader count is incremented if there is no writer. To acquire a write lock, it must wait until there are no readers and no writer.
- Challenge: To avoid writer "starvation" (where new readers continuously acquire the lock, preventing a writer from ever getting it). Fair reader-writer locks implement a mechanism to block new readers from acquiring the lock when a writer is waiting.

5. Memory Barriers: Ensuring Instruction Execution Order

In multiprocessor systems, compilers and CPUs perform instruction reordering for performance. This can lead to a very counterintuitive result: even if you use locks correctly, the code execution order might still be wrong.

Problem Example:
```
// Thread A (executing on CPU A)
data = 42;  // 1. Write data
flag = 1;   // 2. Set flag, indicating data is ready

// Thread B (executing on CPU B)
while (flag == 0); // 3. Wait for flag to be ready
print(data);       // 4. Read data
```
Logically, when thread B executes print(data), data should already be 42. However, due to reordering, CPU A might execute flag=1 before data=42 (because these two write operations are independent). Simultaneously, CPU B's cache might have the updated flag but an old value for data. The result is that thread B prints uninitialized data.
Solution: Use memory barriers.
- Write Barrier: Ensures that all write operations before the barrier become visible to other CPUs before any write operation after the barrier. Inserting a write barrier before flag=1 in Thread A guarantees that the result of data=42 is seen by other CPUs before flag=1.
- Read Barrier: Ensures that all read operations after the barrier are executed after any read operation before the barrier. Inserting a read barrier before print(data) in Thread B guarantees that the latest flag value is read before reading data.

Summary

Multiprocessor synchronization is a complex and profound topic. Its core aspects are:

Foundation: Relying on atomic operations (like CAS) provided by hardware to build the basic units of mutual exclusion.
Evolution: Starting from simple spinlocks, to address their performance bottlenecks, more advanced locks like the MCS lock were developed, improving scalability by changing how the wait queue is organized.
Extension: Designing variants like reader-writer locks based on access pattern (read/write) specificity to increase concurrency.
Guarantee: Using memory barriers to enforce constraints on memory access order. This is key to ensuring program correctness in multiprocessor environments and is typically encapsulated within lock API implementations.

Understanding these mechanisms helps us write system software that is both correct and efficient in today's multi-core era.