A Detailed Explanation of SQL Window Functions
Problem Description
SQL window functions are frequently asked about in interviews, particularly testing your ability to perform complex data analysis and grouped calculations. Questions may require you to compare regular aggregate functions with window functions, or to write SQL using window functions for specific scenarios (such as calculating cumulative percentages, rankings, moving averages, etc.). Mastering window functions can significantly improve data query efficiency and expressive power.
Problem-Solving Process
1. Core Concepts of Window Functions
- Definition: A Window Function performs calculations across a set of table rows that are somehow related to the current row. Unlike regular aggregate functions, it does not cause rows to become grouped into a single output row — the rows retain their separate identities.
- Key Differences:
- Regular aggregate functions (like
SUM(),AVG()) withGROUP BYcollapse rows into groups; - Window functions, defined by an
OVER()clause, define a window of rows and attach the calculation result directly to each row.
- Regular aggregate functions (like
- Basic Syntax:
function_name(expression) OVER ( [PARTITION BY column] -- Groups rows (similar to GROUP BY but without collapsing) [ORDER BY column] -- Orders rows within the partition, affecting calculation order [window_frame] -- Defines the set of rows used for the calculation (e.g., ROWS BETWEEN...) )
2. Categories of Window Functions and Common Functions
-
Ranking Functions:
ROW_NUMBER(): Assigns a unique, sequential rank to rows within a partition (different ranks for ties, based on order).RANK(): Assigns the same rank to ties, with subsequent ranks skipping numbers (e.g., 1, 1, 3).DENSE_RANK(): Assigns the same rank to ties, with ranks being consecutive and not skipping numbers (e.g., 1, 1, 2).
Example:
SELECT name, score, ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num, RANK() OVER (ORDER BY score DESC) AS rank, DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank FROM students; -
Distribution Functions:
PERCENT_RANK(): Calculates the relative rank of a row within a partition (formula: (rank - 1) / (total_rows - 1)).CUME_DIST(): Calculates the cumulative distribution (number of rows with values <= current value / total rows).
-
Value Functions (Accessing Other Rows):
LAG(column, n): Accesses data from a previous row in the result set (n rows before the current row).LEAD(column, n): Accesses data from a subsequent row in the result set (n rows after the current row).
Example (Calculating day-over-day sales growth rate):
SELECT sale_date, amount, LAG(amount, 1) OVER (ORDER BY sale_date) AS prev_amount, (amount - LAG(amount, 1) OVER (ORDER BY sale_date)) / LAG(amount, 1) OVER (ORDER BY sale_date) AS growth_rate FROM sales; -
Aggregate Window Functions:
SUM(),AVG(),COUNT(), etc., combined withOVER()to perform calculations like running totals or moving averages.
Example (Calculating cumulative sales):
SELECT sale_date, amount, SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum FROM sales;
3. Detailed Usage of Window Frames
- Purpose: Defines the precise set of rows used for the calculation when
ORDER BYis present. - Common Frame Clauses:
ROWS BETWEEN ...: Defines frame by physical row offsets.UNBOUNDED PRECEDING: Starts from the first row of the partition.n PRECEDING: n rows before the current row.CURRENT ROW: The current row.n FOLLOWING: n rows after the current row.
RANGE BETWEEN ...: Defines frame by logical value range (e.g., time intervals).
- Example (Calculating a 3-day moving average):
SELECT date, revenue, AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM daily_sales;
4. Comprehensive Application Scenario
Scenario: Query the top 3 highest-paid employees in each department.
SELECT * FROM (
SELECT name, department, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees
) ranked
WHERE dept_rank <= 3;
- Step-by-step Analysis:
- Use
PARTITION BY departmentto create groups by department. - Within each group, order by salary in descending order and use
DENSE_RANK()to calculate the rank. - The outer query filters for rows with a rank less than or equal to 3.
- Use
5. Common Interview Pitfalls and Optimization Tips
- Pitfalls:
- Confusing
ROW_NUMBER()withRANK()leading to incorrect ranking logic. - Overlooking the impact of
ORDER BYon the window frame (withoutORDER BY, the frame defaults to the entire partition).
- Confusing
- Optimizations:
- Avoid using complex subqueries in the
OVER()clause; prioritize usingPARTITION BYfor grouping. - For large datasets, optimize columns used in
PARTITION BYandORDER BYwith indexes.
- Avoid using complex subqueries in the
By following these steps, you will not only be able to write correct window function queries but also explain their underlying logic and applicable scenarios in interviews.