Analysis and Solutions for Redis Cache Avalanche, Penetration, and Breakdown

Analysis and Solutions for Redis Cache Avalanche, Penetration, and Breakdown

Problem Description
In high-concurrency systems, caching (e.g., Redis) is widely used to enhance performance. However, when cache anomalies occur, requests may directly overwhelm the backend database, causing severe performance issues or even service unavailability. Among these, cache avalanche, cache penetration, and cache breakdown are three typical scenarios that must be prevented.

Cache Avalanche: Refers to a situation where a large number of cache keys expire simultaneously at a specific moment. At this point, massive requests cannot retrieve data from the cache and all flood the backend database, causing a sudden surge in pressure or even a crash.
Cache Penetration: Refers to querying data that does not exist in the database. Since the data is also absent in the cache, each request directly accesses the database. If malicious attackers continuously launch large volumes of such requests, the database will be overwhelmed.
Cache Breakdown: Refers to the moment when a hotspot key expires in the cache, coinciding with a large number of requests accessing this key. These requests collectively rush to the database, potentially overwhelming it. The difference from an avalanche is that a breakdown targets a single hotspot key, while an avalanche involves many keys.

Below, we analyze the solutions for each problem step by step.

I. Solutions for Cache Avalanche

The core idea is to avoid a large number of keys expiring at the same time.

Step 1: Set Random Expiration Times
Do not set the same expiration time (TTL) for all cache keys. For example, if business logic requires caching for approximately one hour, we can add a random offset to the base expiration time when setting the key.

Example: Set the key's TTL to 3600 + random.nextInt(600), i.e., 1 hour + a random number between 0 and 10 minutes. This way, the keys' expiration times are distributed between 1 hour and 1 hour 10 minutes, preventing collective expiration.

Step 2: Build a Highly Available Cache Architecture
If the "avalanche" is caused by the Redis service itself crashing, we need to ensure service availability at the architectural level.

Solution: Adopt Redis Sentinel mode or Cluster mode. When the master node fails, Sentinels can automatically perform failover, electing a new master to ensure uninterrupted service.

Step 3: Enable Service Degradation and Circuit Breaking
As a final protective measure, when excessive database pressure is detected, the system can automatically trigger degradation strategies.

Example: Use circuit breaker components like Hystrix. When the failure rate of database access exceeds a threshold, the circuit breaker "opens," and subsequent requests no longer access the database. Instead, they directly return a default value (e.g., empty value, error page), protecting the database. Recovery is attempted after a period.

Step 4: Cache Never Expires (Use with Caution)
For extremely critical and infrequently changing data, consider setting the cache to never expire. Then, use a background job to asynchronously update the cache periodically. This avoids expiration-related issues but increases complexity regarding data consistency.

II. Solutions for Cache Penetration

The core idea is to intercept queries for non-existent data at both the cache and database layers.

Step 1: Parameter Validation and Filtering
The simplest and most effective step. Perform validity checks on request parameters at the API gateway or business logic layer. For example, if querying user information with a negative or obviously invalid user ID, return an error directly without accessing the cache or database.

Step 2: Cache Empty Values
When a database query returns empty, we still cache this "empty result" (e.g., null, "") with a short expiration time (e.g., 5 minutes).

Example: redis.setex("user:99999", 300, "NULL"). This way, subsequent requests for the same non-existent user:99999 will be intercepted at the cache layer within 5 minutes. Note: To prevent attackers from using different IDs, the expiration time for empty-value caching should not be too long.

Step 3: Use a Bloom Filter
This is a more efficient and space-saving solution. A Bloom Filter is a probabilistic data structure used to quickly determine whether an element definitely does not exist in a set.

Preparation: On system startup, preload all existing keys (e.g., all valid user IDs) from the database into the Bloom Filter.
Query Process:
1. When a request arrives, first use the Bloom Filter to determine if the queried key exists.
2. If judged as "non-existent": The key is definitely not in the database. Return an empty result directly without querying the cache or database.
3. If judged as "exist": There is a small probability of false positives (but the key is highly likely to exist). Proceed with the normal flow of querying the cache and database.
Advantage: Minimal memory usage, effectively resisting large-scale random key attacks.

III. Solutions for Cache Breakdown

The core idea is to prevent a large number of concurrent requests from accessing the database when a single hotspot key expires.

Step 1: Mutex Lock
This is the classic solution. When the cache expires, instead of allowing all requests to access the database, only one request is permitted to rebuild the cache, while others wait.

Process:
1. Request A finds that cache hotkey has expired.
2. Request A attempts to acquire a distributed lock (e.g., using Redis command SET lock_key value NX PX 30000, meaning set only if the lock does not exist, with a 30-second expiration).
3. If Request A successfully acquires the lock, it queries the database, writes the result to the cache, and finally releases the lock.
4. Other concurrent requests (B, C, D...) fail to acquire the lock. They wait for a short time (e.g., spin or block) and then retry fetching data from the cache. By this time, Request A has loaded the new data into the cache, so they can retrieve the result directly.
Note: The lock must have an expiration time to prevent deadlock if the request holding the lock crashes unexpectedly.

Step 2: Logical Expiration
Instead of setting an absolute expiration time (TTL) for cached data, store a logical expiration timestamp within the cache value.

Data Structure: value = {data: actual_data, expireTime: 1730000000000}
Process:
1. The request retrieves the value from the cache and checks if the current time is less than expireTime.
2. If not expired: Return data directly.
3. If expired: Attempt to acquire a mutex lock. The request that acquires the lock starts an asynchronous thread to update the cache, while the current request returns the old (expired) data. Other requests also return the old data directly.
Advantage: User requests do not need to wait for cache rebuilding, providing a better experience. However, there may be brief data inconsistency (returning old data).

Step 3: Never Expire
Similar to the strategy for handling avalanches, set extremely hot keys to never expire and update them via background scheduled tasks. This completely avoids breakdown issues but also requires handling data consistency.

Summary

Problem	Core Cause	Key Solutions
Cache Avalanche	Many keys expire simultaneously	Set random expiration times, cache high availability, service circuit breaking
Cache Penetration	Querying non-existent data	Parameter validation, cache empty values, Bloom Filter
Cache Breakdown	Single hotspot key expires	Mutex lock, logical expiration

In practical projects, it's often necessary to combine these strategies based on business scenarios. For example, use "mutex lock + logical expiration" for hotspot data, set random TTLs for all cache keys, and cache empty values for potentially non-existent query results to build a robust, high-performance caching system.