Hotspot Key Issue and Solutions in Distributed Caching
Problem Description:
In a distributed caching system under high concurrent access, when a specific key (hotspot key) is accessed by a large number of requests in a short period, even though the cache layer itself is distributed, this specific key is only routed to a single server node within the cluster. This causes that node to bear immense access pressure in a short time, potentially leading to issues like soaring CPU usage, saturated bandwidth, exhausted connections, and ultimately service unavailability. Please analyze this problem and discuss systemic solutions.
Knowledge Explanation:
I. Problem Nature and Hazards
- Core Issue: Distributed caching systems (e.g., Redis Cluster) use algorithms like consistent hashing to distribute different keys across different nodes for load balancing. However, the granularity of this "distribution" is the key. If a single key becomes "hot," all requests for that key will hit the same node.
- Triggering Scenarios:
- Hot News/Weibo Posts: The cache key corresponding to the ID of a breaking news story.
- Top Celebrities/Influencers: The cache key for a celebrity's detailed information page.
- Flash Sale/Promotional Items: The cache key for stock information of a popular flash-sale product.
- Resulting Hazards:
- Physical Server Overload: The target cache node may crash due to resource exhaustion (CPU, memory, network bandwidth, connections).
- Service Avalanche: After the node crashes, requests for the hotspot key will penetrate to the database according to the cache cluster mechanism, potentially overwhelming the database instantly and causing widespread service failure.
- Data Inconsistency: During periods of high node pressure or downtime, the data for that key may not be updated correctly.
II. Evolution of Solution Approaches
The core idea for solving this problem is: Distributing the hotspot pressure originally concentrated on a single node across multiple nodes in the system.
Solution 1: Local Cache + Random Expiration (Client-side Solution)
This is the simplest and most direct solution, implemented at the application layer (client-side).
-
Procedure:
- Level 1 Cache: Before querying the distributed cache (e.g., Redis), first query the application server's local cache (e.g., Guava Cache, Caffeine).
- Cache Hotspot Data: When a key is identified as a definite hotspot (e.g., via a configured list of hotspot keys), the application server, after retrieving the data from Redis, also caches a copy in local memory.
- Set Short Expiration: To avoid prolonged inconsistency between local cache and central cache data, set a short expiration time for the local cache (e.g., 1-5 seconds).
- Add Random Jitter: To prevent local caches on all servers from expiring simultaneously, causing a surge of requests back to Redis, add a random value to the expiration time (e.g., base 3 seconds ± random 2 seconds).
-
Advantages: Simple implementation, effectively reduces request pressure on the Redis hotspot node.
-
Disadvantages:
- Poor data consistency, with short-term data staleness.
- Consumes application server memory.
- Ineffective for unpredictable, suddenly emerging hotspot keys.
Solution 2: Hot Key Detection and Splitting (Server-side Solution)
This solution is implemented on the cache server side (or a proxy layer), transparent to the application.
-
Procedure:
- Real-time Monitoring: In the cache proxy layer (e.g., Codis Proxy) or on each Redis node, monitor the access frequency of each key in real-time. When a key's access rate within a unit time exceeds a preset threshold, it is identified as a hotspot key.
- Automatic Splitting: The system automatically splits this hotspot key into multiple new keys. For example, the original hotspot key
product_info_123can be split into:product_info_123_1product_info_123_2product_info_123_3
- Data Synchronization: Set the value for all these split keys to the same data.
- Request Distribution: At the proxy layer, when a request for
product_info_123is received, instead of always accessing the same key, a rule (like random or round-robin) is used to select one key from the split list (..._1,..._2,..._3) for access. This distributes the traffic evenly across different nodes in the cluster.
-
Advantages: Non-invasive to business code, capable of dynamically handling sudden hotspots.
-
Disadvantages: Complex architecture, requires custom development of cache middleware or using advanced versions with this feature.
Solution 3: Multi-level Cache Architecture (Architecture-level Solution)
This is a more thorough solution, building a multi-layered caching system to smooth traffic.
-
Procedure:
- L1 - Local Cache (Application Layer): As described in Solution 1, using Guava/Caffeine.
- L2 - Distributed Sharded Cache Cluster (Cache Layer): Such as Redis Cluster, handling the majority of cached data.
- Introduce Cache Proxy or Gateway: Add a layer like Nginx or a specialized cache proxy between the application layer and the Redis cluster. This layer can also provide caching.
- Proxy Layer Caching: At the Nginx proxy layer, caching can also be configured. For hotspot requests, after the first request reaches Nginx and fetches data from the application server and Redis, Nginx can cache it. Subsequent identical requests are returned directly at the Nginx layer, not even reaching the application server, significantly reducing backend pressure.
- Combined Use: Local cache + Nginx proxy cache + Redis cluster, forming a multi-level defense line.
-
Advantages: Strongest defensive capability, effective against various traffic spikes.
-
Disadvantages: Very complex system architecture, high data consistency maintenance cost, and significant operational complexity.
Summary and Comparison
| Solution | Implementation Level | Advantages | Disadvantages | Applicable Scenarios |
|---|---|---|---|---|
| Local Cache + Random Expiration | Application Layer (Client) | Simple implementation, quick results | Weak consistency, ineffective for unknown hotspots | Foreseeable hotspots, scenarios with low consistency requirements |
| Hot Key Detection and Splitting | Cache Service/Proxy Layer | Transparent to application, dynamic response | Complex architecture, requires custom development | Large-scale high-concurrency systems with middleware R&D capabilities |
| Multi-level Cache Architecture | System Architecture Layer | Strongest defense, high performance | Complex architecture, high operational cost | Ultra-large-scale internet applications, e.g., e-commerce flash sales, social media hotspots |
In practical production, multiple solutions are often combined. For example, using Solution 2 (Hot Key Detection and Splitting) as the primary defense, while for the most critical core business (like flash sales), proactively using Solution 1 (Local Cache) for预热 and reinforcement.