Understanding Window Functions in SQL

Window functions can be a useful tool when writing SQL queries. They allow us to include aggregate data across multiple rows without getting those pesky errors that sometimes occur when we try to use an aggregate function in the wrong way.

In this article, I aim to provide a simple overview of window functions and how they can be used to provide a more useful result set when running SQL queries.

What is a Window Function?

In SQL, a window function is a special aggregate-like function that enables partitioning and ordering of data within a result set. Unlike regular aggregate functions, which collapse multiple rows into a single result, window functions retain individual rows in the output, augmenting them with additional computed values.

If you’ve ever encountered an error when using an aggregate function without a GROUP BY clause, there’s a good chance that you could have solved the problem with a window function.

Window Function Syntax

The basic syntax for creating window functions goes like this:

OVER ( window_spec | window_name )

Where:

  • window_spec is the actual specification for the window function
  • window_name is the name of a window that’s defined in a WINDOW clause elsewhere in the query

Which basically means that we have the option of providing the actual specification for the window function within the OVER clause, or providing the name of a named window that contains the specification.

Here’s the basic syntax for providing the specification within the actual OVER clause:

OVER (   
       [ <PARTITION BY clause> ]  
       [ <ORDER BY clause> ]   
       [ <FRAME clause> ]  
      )  

So it consists of an OVER clause, followed by parentheses, which can contain other optional clauses.

Here’s a basic explanation of the above sub-clauses of the OVER clause:

  • <PARTITION BY clause> – Divides the result set into partitions.
  • <ORDER BY clause> – Defines the logical order of the rows within each partition of the result set.
  • <FRAME clause> – Creates frames within the partitions. A frame is a subset of the current partition, and the frame clause specifies how this subset is defined. The frame clause is typically defined in the form of a ROWS, RANGE, or GROUP clause (or similar, depending on the DBMS).

The exact syntax of each optional clause may differ between DBMS, so it’s best to check the documentation for your DBMS.

Window functions can usually be placed inside the SELECT list, or in the ORDER BY clause.

Example of a Window Function

Suppose we have a table with data that looks like this:

SELECT 
    Name,
    District, 
    Population
FROM City 
WHERE CountryCode = 'AUS';

Result:

+---------------+-----------------+------------+
| Name          | District        | Population |
+---------------+-----------------+------------+
| Sydney        | New South Wales |    3276207 |
| Melbourne     | Victoria        |    2865329 |
| Brisbane      | Queensland      |    1291117 |
| Perth         | West Australia  |    1096829 |
| Adelaide      | South Australia |     978100 |
| Canberra      | Capital Region  |     322723 |
| Gold Coast    | Queensland      |     311932 |
| Newcastle     | New South Wales |     270324 |
| Central Coast | New South Wales |     227657 |
| Wollongong    | New South Wales |     219761 |
| Hobart        | Tasmania        |     126118 |
| Geelong       | Victoria        |     125382 |
| Townsville    | Queensland      |     109914 |
| Cairns        | Queensland      |      92273 |
+---------------+-----------------+------------+
14 rows in set (0.00 sec)

This table contains a list of cities around the world. In this example I narrowed the results to just Australian cities (i.e. those with a CountryCode of AUS).

A common thing to do here might be to use the GROUP BY clause to group our results by district, along with the SQL SUM() function so that we can get the population of each district:

SELECT 
    District, 
    SUM(Population) AS "District Population"
FROM City WHERE CountryCode = 'AUS'
GROUP BY District;

Result:

+-----------------+---------------------+
| District        | District Population |
+-----------------+---------------------+
| New South Wales |             3993949 |
| Victoria        |             2990711 |
| Queensland      |             1805236 |
| West Australia  |             1096829 |
| South Australia |              978100 |
| Capital Region  |              322723 |
| Tasmania        |              126118 |
+-----------------+---------------------+
7 rows in set (0.00 sec)

Here, the values in the District Population column are calculated from the sum of all cities within the respective district. I used the SUM() function to achieve this.

Note that I haven’t used a window function yet. That’s next.

One side-effect of using the GROUP BY clause is that it collapses multiple rows into a single row. This may or may not be our desired effect.

But sometimes we might also want to see both the cities populations and the district populations. This is where a window function can come in handy – it can provide us with both the totals for each city, as well as the totals for each district.

Let’s modify the above query to use a window function:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population" 
FROM City 
WHERE CountryCode = 'AUS' 
ORDER BY District, Name, "City Population";

Result:

+-----------------+---------------+-----------------+---------------------+
| District        | Name          | City Population | District Population |
+-----------------+---------------+-----------------+---------------------+
| Capital Region  | Canberra      |          322723 |              322723 |
| New South Wales | Central Coast |          227657 |             3993949 |
| New South Wales | Newcastle     |          270324 |             3993949 |
| New South Wales | Sydney        |         3276207 |             3993949 |
| New South Wales | Wollongong    |          219761 |             3993949 |
| Queensland      | Brisbane      |         1291117 |             1805236 |
| Queensland      | Cairns        |           92273 |             1805236 |
| Queensland      | Gold Coast    |          311932 |             1805236 |
| Queensland      | Townsville    |          109914 |             1805236 |
| South Australia | Adelaide      |          978100 |              978100 |
| Tasmania        | Hobart        |          126118 |              126118 |
| Victoria        | Geelong       |          125382 |             2990711 |
| Victoria        | Melbourne     |         2865329 |             2990711 |
| West Australia  | Perth         |         1096829 |             1096829 |
+-----------------+---------------+-----------------+---------------------+
14 rows in set (0.01 sec)

So this time the rows aren’t collapsed like they were with the GROUP BY clause. We get the same number of rows that we got in the original query (without the GROUP BY clause). We still get to see the population of each city. But this time we also have the District Population that shows us the total population of each city’s district.

The PARTITION BY clause is one of several optional arguments we can provide with the OVER clause. We can also provide an empty OVER clause (i.e. an OVER clause without any arguments).

Here’s what happens when I provide an empty OVER clause:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population",
    SUM(Population) OVER() AS "Total Population"
FROM City 
WHERE CountryCode = 'AUS' 
ORDER BY District, Name, "City Population";

Result:

+-----------------+---------------+-----------------+---------------------+------------------+
| District        | Name          | City Population | District Population | Total Population |
+-----------------+---------------+-----------------+---------------------+------------------+
| Capital Region  | Canberra      |          322723 |              322723 |         11313666 |
| New South Wales | Central Coast |          227657 |             3993949 |         11313666 |
| New South Wales | Newcastle     |          270324 |             3993949 |         11313666 |
| New South Wales | Sydney        |         3276207 |             3993949 |         11313666 |
| New South Wales | Wollongong    |          219761 |             3993949 |         11313666 |
| Queensland      | Brisbane      |         1291117 |             1805236 |         11313666 |
| Queensland      | Cairns        |           92273 |             1805236 |         11313666 |
| Queensland      | Gold Coast    |          311932 |             1805236 |         11313666 |
| Queensland      | Townsville    |          109914 |             1805236 |         11313666 |
| South Australia | Adelaide      |          978100 |              978100 |         11313666 |
| Tasmania        | Hobart        |          126118 |              126118 |         11313666 |
| Victoria        | Geelong       |          125382 |             2990711 |         11313666 |
| Victoria        | Melbourne     |         2865329 |             2990711 |         11313666 |
| West Australia  | Perth         |         1096829 |             1096829 |         11313666 |
+-----------------+---------------+-----------------+---------------------+------------------+
14 rows in set (0.04 sec)

Here, I added an extra column with an empty OVER clause. It still works, but the empty OVER clause treats the entire set of query rows as a single partition, and therefore calculates the grand total of all cities across all districts.  

By contrast, adding a PARTITION BY clause partitions the result of the OVER clause, so that we can see aggregate data at a more granular level.

So in our example, the empty OVER clause allowed us to get the aggregate SUM() of all rows without getting an error. Here’s what happens when we remove the OVER clause:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population",
    SUM(Population) AS "Total Population"
FROM City 
WHERE CountryCode = 'AUS' 
ORDER BY District, Name, "City Population";

Result:

ERROR 1140 (42000): In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'world.City.District'; this is incompatible with sql_mode=only_full_group_by

I ran this in MySQL and so this is the error that MySQL returns when I try to run this code (at least when my sql_mode is set to only_full_group_by).

If I remove only_full_group_by from my SQL mode, I get the following result:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population",
    SUM(Population) AS "Total Population"
FROM City 
WHERE CountryCode = 'AUS' 
ORDER BY District, Name, "City Population";

Result:

+-----------------+--------+-----------------+---------------------+------------------+
| District        | Name   | City Population | District Population | Total Population |
+-----------------+--------+-----------------+---------------------+------------------+
| New South Wales | Sydney |         3276207 |             3276207 |         11313666 |
+-----------------+--------+-----------------+---------------------+------------------+
1 row in set (0.00 sec)

Which is not exactly the result I wanted either.

When we do this (without the only_full_group_by SQL mode), MySQL is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic. Using the ORDER BY clause doesn’t influence the results either. Result set sorting occurs after values have been chosen, and the ORDER BY clause does not affect which value within each group the server chooses.

Of course, we could always go ahead and use the GROUP BY clause, but then that wouldn’t do what we’re trying to do, and it wouldn’t allow us to get the level of granularity that we can get when using the window function. In short, it would be a completely different query. So the window function can be handy in situations like this.

Non-Aggregate Window Functions

It’s possible to use certain non-aggregate functions as window functions. For example, ROW_NUMBER() is a function available in many RDBMSs that returns the row number for each row within its partition.

Here’s an example of modifying our query to include the ROW_NUMBER() function:

SELECT 
    District, 
    ROW_NUMBER() OVER(PARTITION BY District) AS "Row Number",
    Name, 
    Population AS "City Population"
FROM City 
WHERE CountryCode = 'AUS';

Result:

+-----------------+------------+---------------+-----------------+
| District        | Row Number | Name          | City Population |
+-----------------+------------+---------------+-----------------+
| Capital Region  |          1 | Canberra      |          322723 |
| New South Wales |          1 | Sydney        |         3276207 |
| New South Wales |          2 | Newcastle     |          270324 |
| New South Wales |          3 | Central Coast |          227657 |
| New South Wales |          4 | Wollongong    |          219761 |
| Queensland      |          1 | Brisbane      |         1291117 |
| Queensland      |          2 | Gold Coast    |          311932 |
| Queensland      |          3 | Townsville    |          109914 |
| Queensland      |          4 | Cairns        |           92273 |
| South Australia |          1 | Adelaide      |          978100 |
| Tasmania        |          1 | Hobart        |          126118 |
| Victoria        |          1 | Melbourne     |         2865329 |
| Victoria        |          2 | Geelong       |          125382 |
| West Australia  |          1 | Perth         |         1096829 |
+-----------------+------------+---------------+-----------------+
14 rows in set (0.01 sec)

Here we can see that the Row Number column tells us the row number of each row within each partition. In this case, the query is partitioned by district, and so we get the row number of each city within each district.

List of Window Functions

The list of functions that can be used as a window function will depend on your DBMS. But as an example, here’s a list of window functions supported by MySQL at the time of writing.

Aggregate Functions

The following aggregate functions can be used as window or non-window functions in MySQL, depending on whether or not the OVER clause is present:

AVG()
BIT_AND()
BIT_OR()
BIT_XOR()
COUNT()
JSON_ARRAYAGG()
JSON_OBJECTAGG()
MAX()
MIN()
STDDEV_POP()
STDDEV()
STD()
STDDEV_SAMP()
SUM()
VAR_POP()
VARIANCE()
VAR_SAMP()

Non-Aggregate Functions

Here are the non-aggregate window functions available in MySQL at the time of writing:

CUME_DIST()
DENSE_RANK()
FIRST_VALUE()
LAG()
LAST_VALUE()
LEAD()
NTH_VALUE()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()

These non-aggregate functions are used only as window functions. When using them, the OVER clause is mandatory.

As with anything, check your DBMS’s documentation for a full list of supported window functions.

Named Windows

As mentioned, we can also define our window with the WINDOW clause. When we do this, we create a named window. A named window represents a group of rows in a table that can be used by a window function.

Here’s an example of modifying a previous query to use a named window:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(win) AS "District Population" 
FROM City 
WHERE CountryCode = 'AUS' 
WINDOW win AS (PARTITION BY District)
ORDER BY District, Name, "City Population";

Result:

+-----------------+---------------+-----------------+---------------------+
| District        | Name          | City Population | District Population |
+-----------------+---------------+-----------------+---------------------+
| Capital Region  | Canberra      |          322723 |              322723 |
| New South Wales | Central Coast |          227657 |             3993949 |
| New South Wales | Newcastle     |          270324 |             3993949 |
| New South Wales | Sydney        |         3276207 |             3993949 |
| New South Wales | Wollongong    |          219761 |             3993949 |
| Queensland      | Brisbane      |         1291117 |             1805236 |
| Queensland      | Cairns        |           92273 |             1805236 |
| Queensland      | Gold Coast    |          311932 |             1805236 |
| Queensland      | Townsville    |          109914 |             1805236 |
| South Australia | Adelaide      |          978100 |              978100 |
| Tasmania        | Hobart        |          126118 |              126118 |
| Victoria        | Geelong       |          125382 |             2990711 |
| Victoria        | Melbourne     |         2865329 |             2990711 |
| West Australia  | Perth         |         1096829 |             1096829 |
+-----------------+---------------+-----------------+---------------------+
14 rows in set (0.00 sec)

Here, my OVER clause contains a reference to a named window called win. That window is defined later in the query with the WINDOW clause. In this case, I provided PARTITION BY District as the window’s definition.

Using named windows can be useful if we have multiple columns that use the same window function definition. This saves us from having to create duplicate code when defining each window function.

See Understanding the WINDOW Clause in MySQL for examples of creating named windows in MySQL. The syntax should be similar, if not the same, across most major RDBMSs.