High-Performance Cache System Design: From Local Cache to Distributed Cache

High-Performance Cache System Design: From Local Cache to Distributed Cache

Problem Description
In high-concurrency systems, caching is one of the core technologies to improve performance. Please elaborate on how to design a high-performance cache system, including the application scenarios, core challenges, and key technical solutions for both local cache and distributed cache.

Solution Process

Step 1: Understanding the Basic Value and Core Challenges of Caching
The core goals of caching are reducing data access latency and lowering the load on underlying data sources (such as databases). Its working principle is to temporarily store frequently accessed data in a faster storage medium (e.g., memory).

Key Challenge 1: Data Consistency
When data in the underlying source changes, how to ensure that the data in the cache is synchronized, updated, or invalidated to prevent the application from reading stale or dirty data.
Key Challenge 2: Cache Eviction/Invalidation Policy
Memory is a limited resource, and it's impossible to store all data. We need to decide which data should be retained and which should be evicted.
Key Challenge 3: Cache Penetration, Breakdown, and Avalanche
These are three typical abnormal scenarios under high concurrency. Poor design can lead to system failures.

Step 2: Starting with the Simplest Local Cache
A local cache stores data directly in the application process's memory, offering extremely fast access speed and zero network overhead.

Implementation Solutions:
1. Native Programming Language Structures: Such as Java's HashMap or ConcurrentHashMap. This is very simple but lacks advanced features like automatic invalidation.
2. Professional Local Cache Libraries: Such as Google Guava's LoadingCache or Caffeine. They provide rich features.
Core Features and Configuration:
- Capacity Control: Set the maximum capacity of the cache (based on the number of entries or memory size).
- Eviction/Invalidation Policy:
  - Time-based:
    - expireAfterWrite: Invalidates data after a fixed time period since it was written to the cache. For example, setting it to 10 minutes is suitable for scenarios where data is not frequently updated.
    - expireAfterAccess: Invalidates data after a fixed time period since it was last accessed. Suitable for ensuring that hotspot data remains available.
  - Size-based (Eviction Policy):
    - LRU (Least Recently Used): Evicts the data that hasn't been accessed for the longest time. This is the most commonly used strategy.
    - LFU (Least Frequently Used): Evicts the data with the lowest access frequency.
- Data Loading: LoadingCache can specify a CacheLoader. When a cache miss occurs, this loader is automatically called to load data from a data source like a database and put it into the cache.
Limitations of Local Cache:
- Limited Memory Capacity: Cannot store a large amount of data.
- Difficulty in Ensuring Data Consistency: In a clustered deployment, when one application instance updates its local cache, other instances' caches cannot be automatically updated, leading to data inconsistency.
- Cache Data Not Shared: Each application instance maintains its own cache, causing memory waste.

Step 3: Introducing Distributed Cache to Address the Pain Points of Local Cache
A distributed cache stores data in an independent, scalable cluster. All application instances access this unified caching service, such as Redis or Memcached.

Core Advantages:
- Data Sharing: All application instances access the same data, ensuring consistency.
- Scalability: The capacity and performance of the cache cluster can be dynamically expanded.
- Rich Data Structures: Such as strings, lists, hashes, sets provided by Redis, supporting more complex business scenarios.
Key Technical Solutions:
- High Availability: Typically adopts a master-slave replication model. The master node handles writes, and slave nodes handle reads. If the master node fails, a slave can be promoted to master, ensuring service availability.
- Persistence: Although cache data can be lost, persistence can prevent a huge impact on the database during a cold start. Redis supports RDB (snapshots) and AOF (logging all write commands).
- Cluster Mode: When the data volume is massive, a single instance cannot handle it. Sharding technology is needed to distribute data across multiple nodes. Redis Cluster is the official solution, using a hash slot mechanism for data sharding.

Step 4: Addressing Classic Cache Problems in High-Concurrency Scenarios
This is a crucial part of interviews. A clear understanding of the differences and solutions for the three problems is required.

Cache Penetration
- Problem Description: A large number of requests query for data that does not exist at all in the database. This causes requests to bypass the cache and directly hit the database every time.
- Solutions:
  - Cache Null Objects: Even if the data is not found in the database, set a null value (like null) in the cache with a relatively short expiration time. Subsequent requests are intercepted at the cache layer.
  - Bloom Filter: Place a Bloom filter before the cache. It can quickly determine if a key definitely does not exist in the database with minimal space overhead. For non-existent keys, return directly to avoid accessing the cache and database.
Cache Breakdown
- Problem Description: At the moment a hotspot key expires in the cache, a large number of concurrent requests flood in. These requests find the cache invalid and all go to load data from the database, causing a sudden surge in database pressure.
- Solutions:
  - Mutual Exclusion Lock (Mutex Lock): When the first request finds the cache invalid, it acquires a distributed lock (e.g., via Redis's SETNX command), then loads data from the database and rebuilds the cache. During this time, other requests either wait for the lock to be released and then read the new cache or return a default value.
  - Logical Expiration: Do not set a TTL (Time To Live) for the key (make it never expire). Instead, store a logical expiration time within the value. When a request finds the logical time has expired, another thread asynchronously updates the cache, while the current thread returns the old data. This ensures data is always available.
Cache Avalanche
- Problem Description: At a certain moment, a large number of cache keys expire simultaneously (e.g., cache service restart, or setting the same expiration time). This causes all requests to flood the database, potentially crashing it.
- Solutions:
  - Set Random Expiration Times: When setting expiration times for cached data, add a random value to the base time (e.g., a random number between 1-5 minutes) to avoid a massive number of keys expiring at the same moment.
  - Build a Highly Available Cache Cluster: Use the master-slave replication and cluster modes mentioned earlier to prevent the entire cache service from going down.
  - Service Degradation and Circuit Breaking: When detecting excessive database pressure, return predefined default values (degradation) for non-core services, or temporarily stop the service (circuit breaking) to protect the database.

Step 5: Best Practice – Multi-Level Cache Architecture
In practical large-scale systems, local cache and distributed cache are often combined to form a multi-level cache, balancing performance and consistency.

Typical Architecture: Application -> Local Cache -> Distributed Cache -> Database
Workflow:
1. A request arrives at the application and first queries the local cache (e.g., Caffeine).
2. If there's a local cache hit, return the data directly.
3. If there's a local cache miss, query the distributed cache (e.g., Redis).
4. If there's a distributed cache hit, return the data to the application and also write it to the local cache (with a relatively short expiration time).
5. If there's also a distributed cache miss, query the database, write the result to Redis, and return it to the application.
Data Synchronization: When data is updated, both the distributed cache and the local caches on various application nodes need to be cleared or updated. This can usually be achieved through a publish-subscribe pattern: after updating the database, publish a message to have all application instances clean up their local caches.

Through the progressive explanation of these five steps, a blueprint for a high-performance, highly available cache system design is clearly presented.