Data Compression and Storage Optimization in Databases

Data Compression and Storage Optimization in Databases

Description
Database data compression and storage optimization refer to techniques and algorithms that reduce the physical disk space occupied by data while maintaining, or even improving, query performance and I/O efficiency. As data volumes explode, compression has become a core feature of database systems, significantly lowering storage costs, increasing cache utilization, and reducing network transmission overhead. This topic covers compression principles, common technical approaches, implementation strategies, and their impact on performance.

Problem-Solving Process

1. Basic Principles of Compression

Data Redundancy Identification: The essence of compression is eliminating redundant information in data. Common redundancy types include:
- Spatial Redundancy: The same value repeats in multiple locations (e.g., numerous "Male" values in a gender field).
- Dictionary Redundancy: Short strings repeat frequently (e.g., "ACTIVE"/"INACTIVE" in a status field).
- Numerical Redundancy: Consecutive data has small differences (e.g., timestamp sequence: 1001, 1002, 1003...).
Encoding Optimization: Using shorter symbols to represent frequently occurring data patterns. Examples:
- Huffman Coding: High-frequency values use short codes, low-frequency values use long codes.
- Run-Length Encoding (RLE): Simplifies consecutive repeated values into "value + repetition count" (e.g., AAAAABBB compresses to A5B3).

2. Classification of Database Compression Techniques

Row-Level Compression:
- In-Page Dictionary Compression: Builds a dictionary within a single data page, replacing repeated field values with short integer IDs. For example, multiple occurrences of "Human Resources" in a "Department" field on a page can be mapped to the number 1.
- NULL Value Optimization: NULL values do not occupy physical storage space, only marked via bitmaps.
- Data Type Optimization: For example, using SMALLINT instead of INT to store small numbers.
Column-Level Compression:
- Dictionary Compression: Builds a global dictionary for an entire column. For example, a "Country" column with hundreds of millions of rows but only 200 unique values can be encoded using 1-byte numbers instead of strings.
- Bitmap Encoding: For low-cardinality columns (few unique values), creates a bitmap sequence for each value (e.g., Gender column: Male=10, Female=01).
- Delta Encoding: For sorted numerical columns, stores the difference between adjacent values (e.g., time series: 1000, 1001, 1003 → 1000, +1, +2).
Hybrid Compression:
- PAX (Partition Attributes Across): Stores data by column within a data page, improving CPU cache hit rates.
- Hybrid Row-Column Storage: Such as Apache Parquet, which stores data by column at the file level but groups it internally into row groups.

3. Compression Implementation Strategies

Application-Layer Compression: The application compresses data (e.g., gzip a JSON field) before writing to the database, which treats it as a binary large object. Advantage is flexibility, but the database cannot optimize queries on the compressed data.
Storage Engine Compression:
- Transparent Page Compression: Such as InnoDB's page compression, which uses the zlib algorithm to compress entire data pages, reducing I/O but increasing CPU overhead.
- Columnar Compression: In columnar storage (e.g., ClickHouse, Vertica), higher compression efficiency is achieved due to uniform data types within a column. Fast algorithms like LZ4 and Zstandard are commonly used.
Index Compression:
- Prefix Compression: For B+ tree index keys, stores only the differing part from the previous key (e.g., index keys "apple", "application" can be compressed to "apple", "5ication").
- Bitmap Index Compression: Uses Run-Length Encoding (RLE) to compress long sequences of 0s and 1s.

4. Trade-off Between Compression and Performance

I/O Benefit: Compression reduces the amount of disk reads, especially beneficial for sequential scans. For example, a 10:1 compression ratio can reduce I/O time for a full table scan by 90%.
CPU Cost: Compression/decompression requires additional computation. Algorithms whose speed matches storage I/O should be chosen (e.g., faster algorithms are needed for SSDs).
Cache Efficiency: After compression, more hot data can fit into the memory buffer pool, reducing cache misses.
Write Amplification Problem: Updating compressed data may require rewriting the entire compression unit (e.g., a page), necessitating appropriate flushing strategies.

5. Practical Selection Recommendations

High Compression Scenarios:
- Data warehouses (read-heavy, write-light), archival data.
- Tables with many text fields or many enum-like values.
- Columnar storage databases.
Low Compression Scenarios:
- High-frequency update OLTP tables (to avoid write amplification).
- Already encrypted or randomly distributed data (low compression ratio).
Algorithm Selection:
- High Compression Ratio: Zlib (suitable for cold data).
- Balanced: Zstandard (good balance between speed and ratio).
- High Speed, Low Overhead: LZ4 (suitable for real-time queries).

By reasonably applying these techniques, an optimal balance can be achieved among storage cost, I/O efficiency, and computational resources. In practice, testing and tuning based on data characteristics, access patterns, and hardware configuration are necessary.