Database Data Type Selection and Storage Optimization

Database Data Type Selection and Storage Optimization

Problem Description
Data type selection is a fundamental aspect of database design, directly impacting storage efficiency, computational performance, and data accuracy. This topic will delve into how to choose appropriate data types based on business scenarios, analyze their effects on storage space, index efficiency, and query performance, while also explaining practical techniques for data type conversion and optimization.

I. Core Principles of Data Type Selection

Precise Matching Principle
- Select the most precise type based on data characteristics (e.g., use INT for integers, not VARCHAR).
- Example: An age field should be TINYINT UNSIGNED (range 0-255), not INT or VARCHAR.
- Advantages: Reduces storage space and improves comparison operation efficiency.
Minimum Adequacy Principle
- Choose the smallest storage type that meets business requirements.
- Integer type selection ladder:
  TINYINT (1 byte) → SMALLINT (2 bytes) → MEDIUMINT (3 bytes) → INT (4 bytes) → BIGINT (8 bytes)
- Floating-point selection: Precision differences between FLOAT (4 bytes) and DOUBLE (8 bytes).
Simplicity Principle
- Prioritize simple types with high computational efficiency (e.g., integer > character > date > large object).
- Avoid using BLOB/TEXT types to store structured data.

II. In-depth Optimization of Numeric Types

Unsigned Optimization for Integer Types

-- Signed INT range: -2147483648 ~ 2147483647
-- Unsigned INT range: 0 ~ 4294967295
CREATE TABLE user (
  age TINYINT UNSIGNED  -- Saves 75% storage space compared to INT
);

Choosing Between Fixed-Point and Floating-Point Numbers
- Financial data must use DECIMAL: Avoid floating-point precision loss.
- Scientific calculations can use FLOAT: Sacrifice precision for storage space.

III. Contextual Selection of String Types

Fixed-Length vs. Variable-Length Comparison

Type Characteristics Applicable Scenarios

CHAR(N) Fixed length, padded with spaces Postal codes, MD5 hashes, etc.

VARCHAR(N) Variable length, extra 1-2 byte overhead Names, addresses, etc.
Impact of Character Set and Collation
- Storage differences between UTF8 and UTF8MB4 (Emojis require UTF8MB4).
- Collation selection: utf8mb4_bin (binary comparison) vs. utf8mb4_general_ci (case-insensitive).

Type	Characteristics	Applicable Scenarios
CHAR(N)	Fixed length, padded with spaces	Postal codes, MD5 hashes, etc.
VARCHAR(N)	Variable length, extra 1-2 byte overhead	Names, addresses, etc.

IV. Fine-Grained Design of Time Types

Time Type Selection Matrix

Type	Storage Space	Precision	Applicable Scenarios
DATETIME	8 bytes	Up to seconds	Requires large time range storage
TIMESTAMP	4 bytes	Up to seconds, with timezone	Requires automatic timestamp updates
DATE	3 bytes	Up to days	Only needs date information

Storage Optimization for Timestamps

-- Prefer TIMESTAMP over DATETIME
CREATE TABLE log (
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP  -- Automatically handles timezone conversion
);

V. Cautious Use of Large Object Types

Usage Limitations of BLOB/TEXT
- Sorting only uses prefix bytes (controlled by max_sort_length parameter).
- Cannot be used as fields in memory temporary tables (triggers disk temporary tables).
Alternative Design Solutions
- Store large files in object storage, with only paths stored in the database.
- Use JSON type instead of TEXT for storing structured/semi-structured data.

VI. Implicit Pitfalls of Data Type Conversion

Index Failure Due to Implicit Type Conversion

-- Assume user_id is VARCHAR but stores numbers
SELECT * FROM users WHERE user_id = 10086;
-- Actual execution: CONVERT(user_id TO INT) = 10086 (full table scan)

Best Practices for Explicit Conversion

-- Standard: Convert filter conditions to match the field type
SELECT * FROM users WHERE user_id = '10086';  -- Correctly uses index

VII. Practical Optimization Cases

Optimizing Gender Field with ENUM Type

-- Original solution: VARCHAR(10) storing "Male"/"Female"
-- Optimized solution: ENUM('Male','Female') or TINYINT(1) (reduces storage by 80%)

Integer Storage Optimization for IP Addresses

-- String storage: VARCHAR(15) requires 15 bytes
-- Integer storage: INT UNSIGNED only 4 bytes
INSERT INTO log (ip) VALUES (INET_ATON('192.168.1.1'));
SELECT INET_NTOA(ip) FROM log;

Summary
Data type selection is the cornerstone of database performance optimization, requiring comprehensive decisions based on business scenarios, data characteristics, and query patterns. Through precise type selection, avoiding implicit conversions, and rationally using advanced type features, dual improvements in storage space and query performance can be achieved.