When running SQL database queries, sometimes we need to use a window function in order to get the results we’re looking for. A window function is an aggregate-like function that enables partitioning and ordering of data within a result set.
The OVER
clause is what enables us to create a window function.
The examples below demonstrate how we can incorporate the OVER
clause in our SQL queries.
Syntax
The exact syntax for creating a window function with SQL will depend on your DBMS, but the basic syntax tends to be pretty consistent across DBMSs:
OVER (
[ <PARTITION BY clause> ]
[ <ORDER BY clause> ]
[ <FRAME clause> ]
)
So the OVER
clause is followed by parentheses, which can contain other optional clauses, depending on our requirements.
The above syntax is how we’d do it if we define the window right within the OVER
clause. We can also define it in a named window. The OVER
clause would then refer to that window name. In this case, the syntax would look like this:
OVER ( window_name )
Where window_name
is the name of a window that’s defined in a WINDOW
clause elsewhere in the query.
As mentioned, the exact syntax of each optional clause may differ between DBMSs, so it’s best to check the documentation for your DBMS.
Window functions are typically permitted inside the SELECT
list, and in some cases in the ORDER BY
clause.
Example
Suppose we have a table with data that looks like this:
SELECT
Name,
District,
Population
FROM City
WHERE CountryCode = 'NLD';
Result:
+-------------------+---------------+------------+ | Name | District | Population | +-------------------+---------------+------------+ | Amsterdam | Noord-Holland | 731200 | | Rotterdam | Zuid-Holland | 593321 | | Haag | Zuid-Holland | 440900 | | Utrecht | Utrecht | 234323 | | Eindhoven | Noord-Brabant | 201843 | | Tilburg | Noord-Brabant | 193238 | | Groningen | Groningen | 172701 | | Breda | Noord-Brabant | 160398 | | Apeldoorn | Gelderland | 153491 | | Nijmegen | Gelderland | 152463 | | Enschede | Overijssel | 149544 | | Haarlem | Noord-Holland | 148772 | | Almere | Flevoland | 142465 | | Arnhem | Gelderland | 138020 | | Zaanstad | Noord-Holland | 135621 | | ´s-Hertogenbosch | Noord-Brabant | 129170 | | Amersfoort | Utrecht | 126270 | | Maastricht | Limburg | 122087 | | Dordrecht | Zuid-Holland | 119811 | | Leiden | Zuid-Holland | 117196 | | Haarlemmermeer | Noord-Holland | 110722 | | Zoetermeer | Zuid-Holland | 110214 | | Emmen | Drenthe | 105853 | | Zwolle | Overijssel | 105819 | | Ede | Gelderland | 101574 | | Delft | Zuid-Holland | 95268 | | Heerlen | Limburg | 95052 | | Alkmaar | Noord-Holland | 92713 | +-------------------+---------------+------------+ 28 rows in set (0.00 sec)
This table contains a list of cities around the world. In this example I narrowed the results to cities in the Netherlands (i.e. those with a CountryCode
of NLD
).
A common modification to this query might look like this:
SELECT
District,
SUM(Population) AS "District Population"
FROM City WHERE CountryCode = 'NLD'
GROUP BY District;
Result:
+---------------+---------------------+ | District | District Population | +---------------+---------------------+ | Noord-Holland | 1219028 | | Zuid-Holland | 1476710 | | Utrecht | 360593 | | Noord-Brabant | 684649 | | Groningen | 172701 | | Gelderland | 545548 | | Overijssel | 255363 | | Flevoland | 142465 | | Limburg | 217139 | | Drenthe | 105853 | +---------------+---------------------+ 10 rows in set (0.00 sec)
Here, I used the GROUP BY
clause to group the results by district, and the SUM()
function to calculate the total population of all cities within each district. This has the effect of collapsing multiple rows with the same district into a single row for that district, as well as providing the total population for that district. The total population for each district is calculated from the sum of all cities within the respective district.
Note that I haven’t used the OVER
clause yet. That’s next.
As seen in the above example, the GROUP BY
clause collapses multiple rows into a single row. This is perfect if we only want to see the results as they appear in that example.
But what if we want to see the populations for both the cities and the districts?
That’s where the OVER
clause comes to the rescue.
Let’s modify the above query to use the OVER
clause:
SELECT
District,
Name AS "City Name",
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population"
FROM City
WHERE CountryCode = 'NLD'
ORDER BY District, Name, "City Population";
Result:
+---------------+-------------------+-----------------+---------------------+ | District | City Name | City Population | District Population | +---------------+-------------------+-----------------+---------------------+ | Drenthe | Emmen | 105853 | 105853 | | Flevoland | Almere | 142465 | 142465 | | Gelderland | Apeldoorn | 153491 | 545548 | | Gelderland | Arnhem | 138020 | 545548 | | Gelderland | Ede | 101574 | 545548 | | Gelderland | Nijmegen | 152463 | 545548 | | Groningen | Groningen | 172701 | 172701 | | Limburg | Heerlen | 95052 | 217139 | | Limburg | Maastricht | 122087 | 217139 | | Noord-Brabant | ´s-Hertogenbosch | 129170 | 684649 | | Noord-Brabant | Breda | 160398 | 684649 | | Noord-Brabant | Eindhoven | 201843 | 684649 | | Noord-Brabant | Tilburg | 193238 | 684649 | | Noord-Holland | Alkmaar | 92713 | 1219028 | | Noord-Holland | Amsterdam | 731200 | 1219028 | | Noord-Holland | Haarlem | 148772 | 1219028 | | Noord-Holland | Haarlemmermeer | 110722 | 1219028 | | Noord-Holland | Zaanstad | 135621 | 1219028 | | Overijssel | Enschede | 149544 | 255363 | | Overijssel | Zwolle | 105819 | 255363 | | Utrecht | Amersfoort | 126270 | 360593 | | Utrecht | Utrecht | 234323 | 360593 | | Zuid-Holland | Delft | 95268 | 1476710 | | Zuid-Holland | Dordrecht | 119811 | 1476710 | | Zuid-Holland | Haag | 440900 | 1476710 | | Zuid-Holland | Leiden | 117196 | 1476710 | | Zuid-Holland | Rotterdam | 593321 | 1476710 | | Zuid-Holland | Zoetermeer | 110214 | 1476710 | +---------------+-------------------+-----------------+---------------------+ 28 rows in set (0.00 sec)
This time the rows aren’t collapsed like they were with the GROUP BY
clause. This time we get to see both the population of each city and the population of each district.
The ORDER BY
Clause
The PARTITION BY
clause (used in the previous example) is one of several optional arguments we can provide to the OVER
clause. We can also add an ORDER BY
clause or a FRAME
clause as shown in the syntax section previously.
Here’s an example that uses an ORDER BY
clause with the ROW_NUMBER()
window function:
SELECT
ROW_NUMBER() OVER(PARTITION BY District) AS "Default",
ROW_NUMBER() OVER(PARTITION BY District ORDER BY Name ASC) AS Ascending,
ROW_NUMBER() OVER(PARTITION BY District ORDER BY Name DESC) AS Descending,
District,
Name AS "City Name"
FROM City
WHERE CountryCode = 'NLD';
Result:
+---------+-----------+------------+---------------+-------------------+ | Default | Ascending | Descending | District | City Name | +---------+-----------+------------+---------------+-------------------+ | 1 | 1 | 1 | Drenthe | Emmen | | 1 | 1 | 1 | Flevoland | Almere | | 2 | 4 | 1 | Gelderland | Nijmegen | | 4 | 3 | 2 | Gelderland | Ede | | 3 | 2 | 3 | Gelderland | Arnhem | | 1 | 1 | 4 | Gelderland | Apeldoorn | | 1 | 1 | 1 | Groningen | Groningen | | 1 | 2 | 1 | Limburg | Maastricht | | 2 | 1 | 2 | Limburg | Heerlen | | 2 | 4 | 1 | Noord-Brabant | Tilburg | | 1 | 3 | 2 | Noord-Brabant | Eindhoven | | 3 | 2 | 3 | Noord-Brabant | Breda | | 4 | 1 | 4 | Noord-Brabant | ´s-Hertogenbosch | | 3 | 5 | 1 | Noord-Holland | Zaanstad | | 4 | 4 | 2 | Noord-Holland | Haarlemmermeer | | 2 | 3 | 3 | Noord-Holland | Haarlem | | 1 | 2 | 4 | Noord-Holland | Amsterdam | | 5 | 1 | 5 | Noord-Holland | Alkmaar | | 2 | 2 | 1 | Overijssel | Zwolle | | 1 | 1 | 2 | Overijssel | Enschede | | 1 | 2 | 1 | Utrecht | Utrecht | | 2 | 1 | 2 | Utrecht | Amersfoort | | 5 | 6 | 1 | Zuid-Holland | Zoetermeer | | 1 | 5 | 2 | Zuid-Holland | Rotterdam | | 4 | 4 | 3 | Zuid-Holland | Leiden | | 2 | 3 | 4 | Zuid-Holland | Haag | | 3 | 2 | 5 | Zuid-Holland | Dordrecht | | 6 | 1 | 6 | Zuid-Holland | Delft | +---------+-----------+------------+---------------+-------------------+ 28 rows in set (0.04 sec)
This example uses the ROW_NUMBER()
window function to return the row number of each row within its partition. We use OVER(PARTITION BY District)
to achieve this. However, we apply ROW_NUMBER()
to three columns, each with different sorting specifications.
The first column in the above example is unsorted, due to the fact that we didn’t explicitly use an ORDER BY
clause (by default, partition rows are unordered and row numbering is nondeterministic). The second column is sorted by the Name
column in ascending order, and the third column uses descending order.
The Frame Clause
In the syntax section above, I listed a <FRAME clause>
as being one of the optional clauses for the OVER
clause. The frame clause is typically defined in the form of a ROWS
, RANGE
, or GROUP
clause (or similar, depending on the DBMS). So, we don’t actually use a FRAME
keyword in our code. We use ROWS
, RANGE
, GROUP
, or whatever the DBMS specifies as the syntax for the FRAME
clause.
When we specify a frame clause, we create frames within the partitions. A frame is a subset of the current partition, and the frame clause specifies how this subset is defined.
Frames are determined with respect to the current row, which enables a frame to move within a partition depending on the location of the current row within its partition.
For example, we can compute running totals for each row by defining a frame to be all rows from the partition start to the current row. Or we can compute rolling averages by defining the frame as extending a specified number of rows on either side of the current row.
Here’s an example that demonstrates this concept:
SELECT
District,
Name,
Population,
SUM(Population) OVER (PARTITION BY District ORDER BY Name
ROWS UNBOUNDED PRECEDING)
AS "Running Total",
AVG(Population) OVER (PARTITION BY District ORDER BY Name
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
AS "Running Average"
FROM City
WHERE CountryCode = 'NLD'
ORDER BY District, Name;
Result:
+---------------+-------------------+------------+---------------+-----------------+ | District | Name | Population | Running Total | Running Average | +---------------+-------------------+------------+---------------+-----------------+ | Drenthe | Emmen | 105853 | 105853 | 105853.0000 | | Flevoland | Almere | 142465 | 142465 | 142465.0000 | | Gelderland | Apeldoorn | 153491 | 153491 | 145755.5000 | | Gelderland | Arnhem | 138020 | 291511 | 131028.3333 | | Gelderland | Ede | 101574 | 393085 | 130685.6667 | | Gelderland | Nijmegen | 152463 | 545548 | 127018.5000 | | Groningen | Groningen | 172701 | 172701 | 172701.0000 | | Limburg | Heerlen | 95052 | 95052 | 108569.5000 | | Limburg | Maastricht | 122087 | 217139 | 108569.5000 | | Noord-Brabant | ´s-Hertogenbosch | 129170 | 129170 | 144784.0000 | | Noord-Brabant | Breda | 160398 | 289568 | 163803.6667 | | Noord-Brabant | Eindhoven | 201843 | 491411 | 185159.6667 | | Noord-Brabant | Tilburg | 193238 | 684649 | 197540.5000 | | Noord-Holland | Alkmaar | 92713 | 92713 | 411956.5000 | | Noord-Holland | Amsterdam | 731200 | 823913 | 324228.3333 | | Noord-Holland | Haarlem | 148772 | 972685 | 330231.3333 | | Noord-Holland | Haarlemmermeer | 110722 | 1083407 | 131705.0000 | | Noord-Holland | Zaanstad | 135621 | 1219028 | 123171.5000 | | Overijssel | Enschede | 149544 | 149544 | 127681.5000 | | Overijssel | Zwolle | 105819 | 255363 | 127681.5000 | | Utrecht | Amersfoort | 126270 | 126270 | 180296.5000 | | Utrecht | Utrecht | 234323 | 360593 | 180296.5000 | | Zuid-Holland | Delft | 95268 | 95268 | 107539.5000 | | Zuid-Holland | Dordrecht | 119811 | 215079 | 218659.6667 | | Zuid-Holland | Haag | 440900 | 655979 | 225969.0000 | | Zuid-Holland | Leiden | 117196 | 773175 | 383805.6667 | | Zuid-Holland | Rotterdam | 593321 | 1366496 | 273577.0000 | | Zuid-Holland | Zoetermeer | 110214 | 1476710 | 351767.5000 | +---------------+-------------------+------------+---------------+-----------------+ 28 rows in set (0.00 sec)
In this case the query has two ROWS
clauses, each with slightly different specifications.
- The
Running Total
column uses theSUM()
function and anOVER
clause with a frame clause to calculate the running total of the population as the result set progresses through the rows. - The
Running Average
column uses theAVG()
function and anOVER
clause with a frame clause to calculate the rolling averages computed from the current row and the rows that immediately precede and follow it.
In some rows in the above example, the Running Total
and Running Average
columns share the same value as the Population
column. That’s because those rows are the only row in their respective partition, and so there are no other rows being used in the calculation.
Empty OVER
Clause
We can also provide an empty OVER
clause. An empty OVER
clause is an OVER
clause without any arguments.
Here’s an example that uses an empty OVER
clause:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population",
SUM(Population) OVER() AS "Total Population"
FROM City
WHERE CountryCode = 'NLD'
ORDER BY District, Name, "City Population";
Result:
+---------------+-------------------+-----------------+---------------------+------------------+ | District | Name | City Population | District Population | Total Population | +---------------+-------------------+-----------------+---------------------+------------------+ | Drenthe | Emmen | 105853 | 105853 | 5180049 | | Flevoland | Almere | 142465 | 142465 | 5180049 | | Gelderland | Apeldoorn | 153491 | 545548 | 5180049 | | Gelderland | Arnhem | 138020 | 545548 | 5180049 | | Gelderland | Ede | 101574 | 545548 | 5180049 | | Gelderland | Nijmegen | 152463 | 545548 | 5180049 | | Groningen | Groningen | 172701 | 172701 | 5180049 | | Limburg | Heerlen | 95052 | 217139 | 5180049 | | Limburg | Maastricht | 122087 | 217139 | 5180049 | | Noord-Brabant | ´s-Hertogenbosch | 129170 | 684649 | 5180049 | | Noord-Brabant | Breda | 160398 | 684649 | 5180049 | | Noord-Brabant | Eindhoven | 201843 | 684649 | 5180049 | | Noord-Brabant | Tilburg | 193238 | 684649 | 5180049 | | Noord-Holland | Alkmaar | 92713 | 1219028 | 5180049 | | Noord-Holland | Amsterdam | 731200 | 1219028 | 5180049 | | Noord-Holland | Haarlem | 148772 | 1219028 | 5180049 | | Noord-Holland | Haarlemmermeer | 110722 | 1219028 | 5180049 | | Noord-Holland | Zaanstad | 135621 | 1219028 | 5180049 | | Overijssel | Enschede | 149544 | 255363 | 5180049 | | Overijssel | Zwolle | 105819 | 255363 | 5180049 | | Utrecht | Amersfoort | 126270 | 360593 | 5180049 | | Utrecht | Utrecht | 234323 | 360593 | 5180049 | | Zuid-Holland | Delft | 95268 | 1476710 | 5180049 | | Zuid-Holland | Dordrecht | 119811 | 1476710 | 5180049 | | Zuid-Holland | Haag | 440900 | 1476710 | 5180049 | | Zuid-Holland | Leiden | 117196 | 1476710 | 5180049 | | Zuid-Holland | Rotterdam | 593321 | 1476710 | 5180049 | | Zuid-Holland | Zoetermeer | 110214 | 1476710 | 5180049 | +---------------+-------------------+-----------------+---------------------+------------------+ 28 rows in set (0.03 sec)
In this example, the Total Population
column uses an empty OVER
clause. The empty OVER
clause treats the entire set of query rows as a single partition. In this case, we get the grand total of all cities across all districts.
The OVER
clause can help in cases where we’re getting an error, due to trying to calculate an aggregate value without a GROUP BY
clause. In the above example, the empty OVER
clause allowed us to get the aggregate SUM()
of all rows without getting an error. Here’s what happens when we remove the OVER
clause from the last column:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population",
SUM(Population) AS "Total Population"
FROM City
WHERE CountryCode = 'NLD'
ORDER BY District, Name, "City Population";
Result:
ERROR 1140 (42000): In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'world.City.District'; this is incompatible with sql_mode=only_full_group_by
I ran this in MySQL and so this is the error that MySQL returns when I try to run this code (at least when my sql_mode
is set to only_full_group_by
).
If I remove only_full_group_by
from my SQL mode, I get the following result:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population",
SUM(Population) AS "Total Population"
FROM City
WHERE CountryCode = 'NLD'
ORDER BY District, Name, "City Population";
Result:
+---------------+-----------+-----------------+---------------------+------------------+ | District | Name | City Population | District Population | Total Population | +---------------+-----------+-----------------+---------------------+------------------+ | Noord-Holland | Amsterdam | 731200 | 731200 | 5180049 | +---------------+-----------+-----------------+---------------------+------------------+ 1 row in set (0.00 sec)
Which is not the desired result either.
When we do this (without the only_full_group_by
SQL mode), MySQL is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic. Using the ORDER BY
clause doesn’t influence the results either. Result set sorting occurs after values have been chosen, and the ORDER BY
clause does not affect which value within each group the server chooses.
Of course, we could always go ahead and use the GROUP BY
clause, but then that wouldn’t do what we’re trying to do, and it wouldn’t allow us to get the level of granularity that we can get when using the window function. In short, it would be a completely different query. So the OVER
clause can be handy in situations like this.
Named Windows
As mentioned, we can also define our window with the WINDOW
clause. When we do this, we create a named window. A named window represents a group of rows in a table that can be used by a window function.
Using named windows can be useful if we have multiple columns that use the same window function definition. This saves us from having to create duplicate code when defining each window function.
See Understanding the WINDOW
Clause in MySQL for examples of creating named windows in MySQL. The syntax should be similar, if not the same, across most major RDBMSs.