A Detailed Explanation of SQL Window Functions

A Detailed Explanation of SQL Window Functions

Problem Description
SQL window functions are frequently asked about in interviews, particularly testing your ability to perform complex data analysis and grouped calculations. Questions may require you to compare regular aggregate functions with window functions, or to write SQL using window functions for specific scenarios (such as calculating cumulative percentages, rankings, moving averages, etc.). Mastering window functions can significantly improve data query efficiency and expressive power.

Problem-Solving Process

1. Core Concepts of Window Functions

  • Definition: A Window Function performs calculations across a set of table rows that are somehow related to the current row. Unlike regular aggregate functions, it does not cause rows to become grouped into a single output row — the rows retain their separate identities.
  • Key Differences:
    • Regular aggregate functions (like SUM(), AVG()) with GROUP BY collapse rows into groups;
    • Window functions, defined by an OVER() clause, define a window of rows and attach the calculation result directly to each row.
  • Basic Syntax:
    function_name(expression) OVER (
      [PARTITION BY column]   -- Groups rows (similar to GROUP BY but without collapsing)
      [ORDER BY column]       -- Orders rows within the partition, affecting calculation order
      [window_frame]          -- Defines the set of rows used for the calculation (e.g., ROWS BETWEEN...)
    )
    

2. Categories of Window Functions and Common Functions

  • Ranking Functions:

    • ROW_NUMBER(): Assigns a unique, sequential rank to rows within a partition (different ranks for ties, based on order).
    • RANK(): Assigns the same rank to ties, with subsequent ranks skipping numbers (e.g., 1, 1, 3).
    • DENSE_RANK(): Assigns the same rank to ties, with ranks being consecutive and not skipping numbers (e.g., 1, 1, 2).
      Example:
    SELECT name, score,
      ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
      RANK() OVER (ORDER BY score DESC) AS rank,
      DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
    FROM students;
    
  • Distribution Functions:

    • PERCENT_RANK(): Calculates the relative rank of a row within a partition (formula: (rank - 1) / (total_rows - 1)).
    • CUME_DIST(): Calculates the cumulative distribution (number of rows with values <= current value / total rows).
  • Value Functions (Accessing Other Rows):

    • LAG(column, n): Accesses data from a previous row in the result set (n rows before the current row).
    • LEAD(column, n): Accesses data from a subsequent row in the result set (n rows after the current row).
      Example (Calculating day-over-day sales growth rate):
    SELECT sale_date, amount,
      LAG(amount, 1) OVER (ORDER BY sale_date) AS prev_amount,
      (amount - LAG(amount, 1) OVER (ORDER BY sale_date)) / LAG(amount, 1) OVER (ORDER BY sale_date) AS growth_rate
    FROM sales;
    
  • Aggregate Window Functions:

    • SUM(), AVG(), COUNT(), etc., combined with OVER() to perform calculations like running totals or moving averages.
      Example (Calculating cumulative sales):
    SELECT sale_date, amount,
      SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum
    FROM sales;
    

3. Detailed Usage of Window Frames

  • Purpose: Defines the precise set of rows used for the calculation when ORDER BY is present.
  • Common Frame Clauses:
    • ROWS BETWEEN ...: Defines frame by physical row offsets.
      • UNBOUNDED PRECEDING: Starts from the first row of the partition.
      • n PRECEDING: n rows before the current row.
      • CURRENT ROW: The current row.
      • n FOLLOWING: n rows after the current row.
    • RANGE BETWEEN ...: Defines frame by logical value range (e.g., time intervals).
  • Example (Calculating a 3-day moving average):
    SELECT date, revenue,
      AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg
    FROM daily_sales;
    

4. Comprehensive Application Scenario
Scenario: Query the top 3 highest-paid employees in each department.

SELECT * FROM (
  SELECT name, department, salary,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
  FROM employees
) ranked
WHERE dept_rank <= 3;
  • Step-by-step Analysis:
    1. Use PARTITION BY department to create groups by department.
    2. Within each group, order by salary in descending order and use DENSE_RANK() to calculate the rank.
    3. The outer query filters for rows with a rank less than or equal to 3.

5. Common Interview Pitfalls and Optimization Tips

  • Pitfalls:
    • Confusing ROW_NUMBER() with RANK() leading to incorrect ranking logic.
    • Overlooking the impact of ORDER BY on the window frame (without ORDER BY, the frame defaults to the entire partition).
  • Optimizations:
    • Avoid using complex subqueries in the OVER() clause; prioritize using PARTITION BY for grouping.
    • For large datasets, optimize columns used in PARTITION BY and ORDER BY with indexes.

By following these steps, you will not only be able to write correct window function queries but also explain their underlying logic and applicable scenarios in interviews.