Differences and Underlying Implementation Principles of SQL Inner and Outer Joins

Differences and Underlying Implementation Principles of SQL Inner and Outer Joins

Problem Description
An interviewer might ask: "Please explain in detail the differences between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN in SQL, and describe how databases implement these joins at the underlying level." This question tests your in-depth understanding of SQL join operations, including semantic differences and the implementation mechanisms behind performance.

I. Semantic Differences of Join Types

INNER JOIN
- Description: Returns only rows where the join condition matches in both tables.
- Example:
```
SELECT * FROM tableA A INNER JOIN tableB B ON A.id = B.id;  
```
  The result only contains records where the id is the same in both A and B.
LEFT JOIN (LEFT OUTER JOIN)
- Description: Returns all records from the left table plus matching records from the right table (fills with NULL when there is no match in the right table).
- Key Point: The left table is the primary table, and the right table is the secondary table.
RIGHT JOIN (RIGHT OUTER JOIN)
- Description: Returns all records from the right table plus matching records from the left table (fills with NULL when there is no match in the left table).
- Note: Can be converted to a LEFT JOIN by swapping the table order; it is less commonly used in practice.
FULL JOIN (FULL OUTER JOIN)
- Description: Returns all records from both left and right tables, merging matched parts and filling unmatched parts with NULL.
- Example: If table A has ids 1 and 2, and table B has ids 2 and 3, the result includes (1,Null), (2,MatchedValue), (Null,3).

II. Underlying Implementation Principles
Databases implement join operations through three core algorithms. The optimizer selects one based on factors such as data volume, indexes, etc.:

Nested Loop Join
- Steps:
  - Iterate through each row of the outer table (the driving table).
  - For each row, iterate through the inner table (the driven table) to find matching conditions.
- Applicable Scenarios:
  - One of the tables has a small amount of data, or the inner table has an efficient index (e.g., B+ tree).
- Example:
```
-- If tableB.id has an index, the optimizer might choose Nested Loop:  
FOR each row A in tableA:  
    SEARCH tableB WHERE B.id = A.id  -- Quick location via index  
```
Hash Join
- Steps:
  - Build Phase: Select the smaller table to build a hash table (key is the join field, value is the row data).
  - Probe Phase: Iterate through the larger table, compute the hash value for each row's join field, and look for matches in the hash table.
- Applicable Scenarios:
  - No index, large data volume, and the hash table for the smaller table can fit in memory.
- Key Limitation: Only applicable to equality joins (e.g., A.id = B.id).
Merge Join (Sort-Merge Join)
- Steps:
  - Sort both tables by the join field (can be skipped if already indexed in order).
  - Use two pointers to traverse the sorted tables and merge matching records (similar to merge sort).
- Applicable Scenarios:
  - Tables are already sorted, or the join result needs ordered output.
- Example:
```
Sorted table A: [1, 2, 5], Sorted table B: [2, 3, 5]  
Pointer i=0 (table A), j=0 (table B):  
   A[0]=1 < B[0]=2 → i++  
   A[1]=2 = B[0]=2 → output match, i++, j++  
   A[2]=5 > B[1]=3 → j++  
   ...  
```

III. How the Optimizer Chooses an Algorithm

Driven by Statistics:
- The optimizer estimates costs based on table row counts, data distribution, index selectivity, etc.
- Example: For a small table driving a large table, Nested Loop with an index has the lowest cost; if both tables lack indexes and have large data volumes, Hash Join is preferred.
Case Analysis:
- Scenario: Table A has 1000 rows, Table B has 100,000 rows. A.id has an index, B.id does not.
- Selection Logic:
  - Nested Loop: Using outer table A as the driver, inner table B requires a full table scan 100,000 times (high cost).
  - Hash Join: Build a hash table with table A (memory-friendly), scan table B once (more optimal).
  - The optimizer will likely choose Hash Join.

IV. Extended Interview Points

LEFT JOIN Filter Condition Pitfall:

-- Incorrect: WHERE B.score > 90 causes LEFT JOIN to degenerate into INNER JOIN  
SELECT * FROM A LEFT JOIN B ON A.id = B.id WHERE B.score > 90;  
-- Correct: Move condition to the ON clause  
SELECT * FROM A LEFT JOIN B ON A.id = B.id AND B.score > 90;

Multi-Table Join Order Optimization: The optimizer may dynamically adjust the join order (e.g., via dynamic programming algorithms) to reduce intermediate result sets.

By understanding the semantic differences and underlying algorithms, you can more accurately predict query performance and optimize indexes or rewrite SQL accordingly.