Window functions can be a useful tool when writing SQL queries. They allow us to include aggregate data across multiple rows without getting those pesky errors that sometimes occur when we try to use an aggregate function in the wrong way.
In this article, I aim to provide a simple overview of window functions and how they can be used to provide a more useful result set when running SQL queries.
What is a Window Function?
In SQL, a window function is a special aggregate-like function that enables partitioning and ordering of data within a result set. Unlike regular aggregate functions, which collapse multiple rows into a single result, window functions retain individual rows in the output, augmenting them with additional computed values.
If you’ve ever encountered an error when using an aggregate function without a GROUP BY
clause, there’s a good chance that you could have solved the problem with a window function.
Window Function Syntax
The basic syntax for creating window functions goes like this:
OVER ( window_spec | window_name )
Where:
window_spec
is the actual specification for the window functionwindow_name
is the name of a window that’s defined in aWINDOW
clause elsewhere in the query
Which basically means that we have the option of providing the actual specification for the window function within the OVER
clause, or providing the name of a named window that contains the specification.
Here’s the basic syntax for providing the specification within the actual OVER
clause:
OVER (
[ <PARTITION BY clause> ]
[ <ORDER BY clause> ]
[ <FRAME clause> ]
)
So it consists of an OVER
clause, followed by parentheses, which can contain other optional clauses.
Here’s a basic explanation of the above sub-clauses of the OVER
clause:
<PARTITION BY clause>
– Divides the result set into partitions.<ORDER BY clause>
– Defines the logical order of the rows within each partition of the result set.<FRAME clause>
– Creates frames within the partitions. A frame is a subset of the current partition, and the frame clause specifies how this subset is defined. The frame clause is typically defined in the form of aROWS
,RANGE
, orGROUP
clause (or similar, depending on the DBMS).
The exact syntax of each optional clause may differ between DBMS, so it’s best to check the documentation for your DBMS.
Window functions can usually be placed inside the SELECT
list, or in the ORDER BY
clause.
Example of a Window Function
Suppose we have a table with data that looks like this:
SELECT
Name,
District,
Population
FROM City
WHERE CountryCode = 'AUS';
Result:
+---------------+-----------------+------------+ | Name | District | Population | +---------------+-----------------+------------+ | Sydney | New South Wales | 3276207 | | Melbourne | Victoria | 2865329 | | Brisbane | Queensland | 1291117 | | Perth | West Australia | 1096829 | | Adelaide | South Australia | 978100 | | Canberra | Capital Region | 322723 | | Gold Coast | Queensland | 311932 | | Newcastle | New South Wales | 270324 | | Central Coast | New South Wales | 227657 | | Wollongong | New South Wales | 219761 | | Hobart | Tasmania | 126118 | | Geelong | Victoria | 125382 | | Townsville | Queensland | 109914 | | Cairns | Queensland | 92273 | +---------------+-----------------+------------+ 14 rows in set (0.00 sec)
This table contains a list of cities around the world. In this example I narrowed the results to just Australian cities (i.e. those with a CountryCode
of AUS
).
A common thing to do here might be to use the GROUP BY
clause to group our results by district, along with the SQL SUM()
function so that we can get the population of each district:
SELECT
District,
SUM(Population) AS "District Population"
FROM City WHERE CountryCode = 'AUS'
GROUP BY District;
Result:
+-----------------+---------------------+ | District | District Population | +-----------------+---------------------+ | New South Wales | 3993949 | | Victoria | 2990711 | | Queensland | 1805236 | | West Australia | 1096829 | | South Australia | 978100 | | Capital Region | 322723 | | Tasmania | 126118 | +-----------------+---------------------+ 7 rows in set (0.00 sec)
Here, the values in the District Population
column are calculated from the sum of all cities within the respective district. I used the SUM()
function to achieve this.
Note that I haven’t used a window function yet. That’s next.
One side-effect of using the GROUP BY
clause is that it collapses multiple rows into a single row. This may or may not be our desired effect.
But sometimes we might also want to see both the cities populations and the district populations. This is where a window function can come in handy – it can provide us with both the totals for each city, as well as the totals for each district.
Let’s modify the above query to use a window function:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population"
FROM City
WHERE CountryCode = 'AUS'
ORDER BY District, Name, "City Population";
Result:
+-----------------+---------------+-----------------+---------------------+ | District | Name | City Population | District Population | +-----------------+---------------+-----------------+---------------------+ | Capital Region | Canberra | 322723 | 322723 | | New South Wales | Central Coast | 227657 | 3993949 | | New South Wales | Newcastle | 270324 | 3993949 | | New South Wales | Sydney | 3276207 | 3993949 | | New South Wales | Wollongong | 219761 | 3993949 | | Queensland | Brisbane | 1291117 | 1805236 | | Queensland | Cairns | 92273 | 1805236 | | Queensland | Gold Coast | 311932 | 1805236 | | Queensland | Townsville | 109914 | 1805236 | | South Australia | Adelaide | 978100 | 978100 | | Tasmania | Hobart | 126118 | 126118 | | Victoria | Geelong | 125382 | 2990711 | | Victoria | Melbourne | 2865329 | 2990711 | | West Australia | Perth | 1096829 | 1096829 | +-----------------+---------------+-----------------+---------------------+ 14 rows in set (0.01 sec)
So this time the rows aren’t collapsed like they were with the GROUP BY
clause. We get the same number of rows that we got in the original query (without the GROUP BY
clause). We still get to see the population of each city. But this time we also have the District Population
that shows us the total population of each city’s district.
The PARTITION BY
clause is one of several optional arguments we can provide with the OVER
clause. We can also provide an empty OVER
clause (i.e. an OVER
clause without any arguments).
Here’s what happens when I provide an empty OVER
clause:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population",
SUM(Population) OVER() AS "Total Population"
FROM City
WHERE CountryCode = 'AUS'
ORDER BY District, Name, "City Population";
Result:
+-----------------+---------------+-----------------+---------------------+------------------+ | District | Name | City Population | District Population | Total Population | +-----------------+---------------+-----------------+---------------------+------------------+ | Capital Region | Canberra | 322723 | 322723 | 11313666 | | New South Wales | Central Coast | 227657 | 3993949 | 11313666 | | New South Wales | Newcastle | 270324 | 3993949 | 11313666 | | New South Wales | Sydney | 3276207 | 3993949 | 11313666 | | New South Wales | Wollongong | 219761 | 3993949 | 11313666 | | Queensland | Brisbane | 1291117 | 1805236 | 11313666 | | Queensland | Cairns | 92273 | 1805236 | 11313666 | | Queensland | Gold Coast | 311932 | 1805236 | 11313666 | | Queensland | Townsville | 109914 | 1805236 | 11313666 | | South Australia | Adelaide | 978100 | 978100 | 11313666 | | Tasmania | Hobart | 126118 | 126118 | 11313666 | | Victoria | Geelong | 125382 | 2990711 | 11313666 | | Victoria | Melbourne | 2865329 | 2990711 | 11313666 | | West Australia | Perth | 1096829 | 1096829 | 11313666 | +-----------------+---------------+-----------------+---------------------+------------------+ 14 rows in set (0.04 sec)
Here, I added an extra column with an empty OVER
clause. It still works, but the empty OVER
clause treats the entire set of query rows as a single partition, and therefore calculates the grand total of all cities across all districts.
By contrast, adding a PARTITION BY
clause partitions the result of the OVER
clause, so that we can see aggregate data at a more granular level.
So in our example, the empty OVER
clause allowed us to get the aggregate SUM()
of all rows without getting an error. Here’s what happens when we remove the OVER
clause:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population",
SUM(Population) AS "Total Population"
FROM City
WHERE CountryCode = 'AUS'
ORDER BY District, Name, "City Population";
Result:
ERROR 1140 (42000): In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'world.City.District'; this is incompatible with sql_mode=only_full_group_by
I ran this in MySQL and so this is the error that MySQL returns when I try to run this code (at least when my sql_mode
is set to only_full_group_by
).
If I remove only_full_group_by
from my SQL mode, I get the following result:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population",
SUM(Population) AS "Total Population"
FROM City
WHERE CountryCode = 'AUS'
ORDER BY District, Name, "City Population";
Result:
+-----------------+--------+-----------------+---------------------+------------------+ | District | Name | City Population | District Population | Total Population | +-----------------+--------+-----------------+---------------------+------------------+ | New South Wales | Sydney | 3276207 | 3276207 | 11313666 | +-----------------+--------+-----------------+---------------------+------------------+ 1 row in set (0.00 sec)
Which is not exactly the result I wanted either.
When we do this (without the only_full_group_by
SQL mode), MySQL is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic. Using the ORDER BY
clause doesn’t influence the results either. Result set sorting occurs after values have been chosen, and the ORDER BY
clause does not affect which value within each group the server chooses.
Of course, we could always go ahead and use the GROUP BY
clause, but then that wouldn’t do what we’re trying to do, and it wouldn’t allow us to get the level of granularity that we can get when using the window function. In short, it would be a completely different query. So the window function can be handy in situations like this.
Non-Aggregate Window Functions
It’s possible to use certain non-aggregate functions as window functions. For example, ROW_NUMBER()
is a function available in many RDBMSs that returns the row number for each row within its partition.
Here’s an example of modifying our query to include the ROW_NUMBER()
function:
SELECT
District,
ROW_NUMBER() OVER(PARTITION BY District) AS "Row Number",
Name,
Population AS "City Population"
FROM City
WHERE CountryCode = 'AUS';
Result:
+-----------------+------------+---------------+-----------------+ | District | Row Number | Name | City Population | +-----------------+------------+---------------+-----------------+ | Capital Region | 1 | Canberra | 322723 | | New South Wales | 1 | Sydney | 3276207 | | New South Wales | 2 | Newcastle | 270324 | | New South Wales | 3 | Central Coast | 227657 | | New South Wales | 4 | Wollongong | 219761 | | Queensland | 1 | Brisbane | 1291117 | | Queensland | 2 | Gold Coast | 311932 | | Queensland | 3 | Townsville | 109914 | | Queensland | 4 | Cairns | 92273 | | South Australia | 1 | Adelaide | 978100 | | Tasmania | 1 | Hobart | 126118 | | Victoria | 1 | Melbourne | 2865329 | | Victoria | 2 | Geelong | 125382 | | West Australia | 1 | Perth | 1096829 | +-----------------+------------+---------------+-----------------+ 14 rows in set (0.01 sec)
Here we can see that the Row Number
column tells us the row number of each row within each partition. In this case, the query is partitioned by district, and so we get the row number of each city within each district.
List of Window Functions
The list of functions that can be used as a window function will depend on your DBMS. But as an example, here’s a list of window functions supported by MySQL at the time of writing.
Aggregate Functions
The following aggregate functions can be used as window or non-window functions in MySQL, depending on whether or not the OVER
clause is present:
AVG()
BIT_AND()
BIT_OR()
BIT_XOR()
COUNT()
JSON_ARRAYAGG()
JSON_OBJECTAGG()
MAX()
MIN()
STDDEV_POP()
STDDEV()
STD()
STDDEV_SAMP()
SUM()
VAR_POP()
VARIANCE()
VAR_SAMP()
Non-Aggregate Functions
Here are the non-aggregate window functions available in MySQL at the time of writing:
CUME_DIST()
DENSE_RANK()
FIRST_VALUE()
LAG()
LAST_VALUE()
LEAD()
NTH_VALUE()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()
These non-aggregate functions are used only as window functions. When using them, the OVER
clause is mandatory.
As with anything, check your DBMS’s documentation for a full list of supported window functions.
Named Windows
As mentioned, we can also define our window with the WINDOW
clause. When we do this, we create a named window. A named window represents a group of rows in a table that can be used by a window function.
Here’s an example of modifying a previous query to use a named window:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(win) AS "District Population"
FROM City
WHERE CountryCode = 'AUS'
WINDOW win AS (PARTITION BY District)
ORDER BY District, Name, "City Population";
Result:
+-----------------+---------------+-----------------+---------------------+ | District | Name | City Population | District Population | +-----------------+---------------+-----------------+---------------------+ | Capital Region | Canberra | 322723 | 322723 | | New South Wales | Central Coast | 227657 | 3993949 | | New South Wales | Newcastle | 270324 | 3993949 | | New South Wales | Sydney | 3276207 | 3993949 | | New South Wales | Wollongong | 219761 | 3993949 | | Queensland | Brisbane | 1291117 | 1805236 | | Queensland | Cairns | 92273 | 1805236 | | Queensland | Gold Coast | 311932 | 1805236 | | Queensland | Townsville | 109914 | 1805236 | | South Australia | Adelaide | 978100 | 978100 | | Tasmania | Hobart | 126118 | 126118 | | Victoria | Geelong | 125382 | 2990711 | | Victoria | Melbourne | 2865329 | 2990711 | | West Australia | Perth | 1096829 | 1096829 | +-----------------+---------------+-----------------+---------------------+ 14 rows in set (0.00 sec)
Here, my OVER
clause contains a reference to a named window called win
. That window is defined later in the query with the WINDOW
clause. In this case, I provided PARTITION BY District
as the window’s definition.
Using named windows can be useful if we have multiple columns that use the same window function definition. This saves us from having to create duplicate code when defining each window function.
See Understanding the WINDOW
Clause in MySQL for examples of creating named windows in MySQL. The syntax should be similar, if not the same, across most major RDBMSs.