Introduction to the OVER Clause in SQL

When running SQL database queries, sometimes we need to use a window function in order to get the results we’re looking for. A window function is an aggregate-like function that enables partitioning and ordering of data within a result set.

The OVER clause is what enables us to create a window function.

The examples below demonstrate how we can incorporate the OVER clause in our SQL queries.

Syntax

The exact syntax for creating a window function with SQL will depend on your DBMS, but the basic syntax tends to be pretty consistent across DBMSs:

OVER (   
       [ <PARTITION BY clause> ]  
       [ <ORDER BY clause> ]   
       [ <FRAME clause> ]
      )  

So the OVER clause is followed by parentheses, which can contain other optional clauses, depending on our requirements.

The above syntax is how we’d do it if we define the window right within the OVER clause. We can also define it in a named window. The OVER clause would then refer to that window name. In this case, the syntax would look like this:

OVER ( window_name )

Where window_name is the name of a window that’s defined in a WINDOW clause elsewhere in the query.

As mentioned, the exact syntax of each optional clause may differ between DBMSs, so it’s best to check the documentation for your DBMS.

Window functions are typically permitted inside the SELECT list, and in some cases in the ORDER BY clause.

Example

Suppose we have a table with data that looks like this:

SELECT 
    Name,
    District, 
    Population
FROM City 
WHERE CountryCode = 'NLD';

Result:

+-------------------+---------------+------------+
| Name              | District      | Population |
+-------------------+---------------+------------+
| Amsterdam         | Noord-Holland |     731200 |
| Rotterdam         | Zuid-Holland  |     593321 |
| Haag              | Zuid-Holland  |     440900 |
| Utrecht           | Utrecht       |     234323 |
| Eindhoven         | Noord-Brabant |     201843 |
| Tilburg           | Noord-Brabant |     193238 |
| Groningen         | Groningen     |     172701 |
| Breda             | Noord-Brabant |     160398 |
| Apeldoorn         | Gelderland    |     153491 |
| Nijmegen          | Gelderland    |     152463 |
| Enschede          | Overijssel    |     149544 |
| Haarlem           | Noord-Holland |     148772 |
| Almere            | Flevoland     |     142465 |
| Arnhem            | Gelderland    |     138020 |
| Zaanstad          | Noord-Holland |     135621 |
| ´s-Hertogenbosch  | Noord-Brabant |     129170 |
| Amersfoort        | Utrecht       |     126270 |
| Maastricht        | Limburg       |     122087 |
| Dordrecht         | Zuid-Holland  |     119811 |
| Leiden            | Zuid-Holland  |     117196 |
| Haarlemmermeer    | Noord-Holland |     110722 |
| Zoetermeer        | Zuid-Holland  |     110214 |
| Emmen             | Drenthe       |     105853 |
| Zwolle            | Overijssel    |     105819 |
| Ede               | Gelderland    |     101574 |
| Delft             | Zuid-Holland  |      95268 |
| Heerlen           | Limburg       |      95052 |
| Alkmaar           | Noord-Holland |      92713 |
+-------------------+---------------+------------+
28 rows in set (0.00 sec)

This table contains a list of cities around the world. In this example I narrowed the results to cities in the Netherlands (i.e. those with a CountryCode of NLD).

A common modification to this query might look like this:

SELECT 
    District, 
    SUM(Population) AS "District Population"
FROM City WHERE CountryCode = 'NLD'
GROUP BY District;

Result:

+---------------+---------------------+
| District      | District Population |
+---------------+---------------------+
| Noord-Holland |             1219028 |
| Zuid-Holland  |             1476710 |
| Utrecht       |              360593 |
| Noord-Brabant |              684649 |
| Groningen     |              172701 |
| Gelderland    |              545548 |
| Overijssel    |              255363 |
| Flevoland     |              142465 |
| Limburg       |              217139 |
| Drenthe       |              105853 |
+---------------+---------------------+
10 rows in set (0.00 sec)

Here, I used the GROUP BY clause to group the results by district, and the SUM() function to calculate the total population of all cities within each district. This has the effect of collapsing multiple rows with the same district into a single row for that district, as well as providing the total population for that district. The total population for each district is calculated from the sum of all cities within the respective district.

Note that I haven’t used the OVER clause yet. That’s next.

As seen in the above example, the GROUP BY clause collapses multiple rows into a single row. This is perfect if we only want to see the results as they appear in that example.

But what if we want to see the populations for both the cities and the districts?

That’s where the OVER clause comes to the rescue.

Let’s modify the above query to use the OVER clause:

SELECT 
    District, 
    Name AS "City Name", 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population" 
FROM City 
WHERE CountryCode = 'NLD' 
ORDER BY District, Name, "City Population";

Result:

+---------------+-------------------+-----------------+---------------------+
| District      | City Name         | City Population | District Population |
+---------------+-------------------+-----------------+---------------------+
| Drenthe       | Emmen             |          105853 |              105853 |
| Flevoland     | Almere            |          142465 |              142465 |
| Gelderland    | Apeldoorn         |          153491 |              545548 |
| Gelderland    | Arnhem            |          138020 |              545548 |
| Gelderland    | Ede               |          101574 |              545548 |
| Gelderland    | Nijmegen          |          152463 |              545548 |
| Groningen     | Groningen         |          172701 |              172701 |
| Limburg       | Heerlen           |           95052 |              217139 |
| Limburg       | Maastricht        |          122087 |              217139 |
| Noord-Brabant | ´s-Hertogenbosch  |          129170 |              684649 |
| Noord-Brabant | Breda             |          160398 |              684649 |
| Noord-Brabant | Eindhoven         |          201843 |              684649 |
| Noord-Brabant | Tilburg           |          193238 |              684649 |
| Noord-Holland | Alkmaar           |           92713 |             1219028 |
| Noord-Holland | Amsterdam         |          731200 |             1219028 |
| Noord-Holland | Haarlem           |          148772 |             1219028 |
| Noord-Holland | Haarlemmermeer    |          110722 |             1219028 |
| Noord-Holland | Zaanstad          |          135621 |             1219028 |
| Overijssel    | Enschede          |          149544 |              255363 |
| Overijssel    | Zwolle            |          105819 |              255363 |
| Utrecht       | Amersfoort        |          126270 |              360593 |
| Utrecht       | Utrecht           |          234323 |              360593 |
| Zuid-Holland  | Delft             |           95268 |             1476710 |
| Zuid-Holland  | Dordrecht         |          119811 |             1476710 |
| Zuid-Holland  | Haag              |          440900 |             1476710 |
| Zuid-Holland  | Leiden            |          117196 |             1476710 |
| Zuid-Holland  | Rotterdam         |          593321 |             1476710 |
| Zuid-Holland  | Zoetermeer        |          110214 |             1476710 |
+---------------+-------------------+-----------------+---------------------+
28 rows in set (0.00 sec)

This time the rows aren’t collapsed like they were with the GROUP BY clause. This time we get to see both the population of each city and the population of each district.

The ORDER BY Clause

The PARTITION BY clause (used in the previous example) is one of several optional arguments we can provide to the OVER clause. We can also add an ORDER BY clause or a FRAME clause as shown in the syntax section previously.

Here’s an example that uses an ORDER BY clause with the ROW_NUMBER() window function:

SELECT 
    ROW_NUMBER() OVER(PARTITION BY District) AS "Default",
    ROW_NUMBER() OVER(PARTITION BY District ORDER BY Name ASC) AS Ascending,
    ROW_NUMBER() OVER(PARTITION BY District ORDER BY Name DESC) AS Descending,
    District,
    Name AS "City Name"
FROM City 
WHERE CountryCode = 'NLD';

Result:

+---------+-----------+------------+---------------+-------------------+
| Default | Ascending | Descending | District      | City Name         |
+---------+-----------+------------+---------------+-------------------+
|       1 |         1 |          1 | Drenthe       | Emmen             |
|       1 |         1 |          1 | Flevoland     | Almere            |
|       2 |         4 |          1 | Gelderland    | Nijmegen          |
|       4 |         3 |          2 | Gelderland    | Ede               |
|       3 |         2 |          3 | Gelderland    | Arnhem            |
|       1 |         1 |          4 | Gelderland    | Apeldoorn         |
|       1 |         1 |          1 | Groningen     | Groningen         |
|       1 |         2 |          1 | Limburg       | Maastricht        |
|       2 |         1 |          2 | Limburg       | Heerlen           |
|       2 |         4 |          1 | Noord-Brabant | Tilburg           |
|       1 |         3 |          2 | Noord-Brabant | Eindhoven         |
|       3 |         2 |          3 | Noord-Brabant | Breda             |
|       4 |         1 |          4 | Noord-Brabant | ´s-Hertogenbosch  |
|       3 |         5 |          1 | Noord-Holland | Zaanstad          |
|       4 |         4 |          2 | Noord-Holland | Haarlemmermeer    |
|       2 |         3 |          3 | Noord-Holland | Haarlem           |
|       1 |         2 |          4 | Noord-Holland | Amsterdam         |
|       5 |         1 |          5 | Noord-Holland | Alkmaar           |
|       2 |         2 |          1 | Overijssel    | Zwolle            |
|       1 |         1 |          2 | Overijssel    | Enschede          |
|       1 |         2 |          1 | Utrecht       | Utrecht           |
|       2 |         1 |          2 | Utrecht       | Amersfoort        |
|       5 |         6 |          1 | Zuid-Holland  | Zoetermeer        |
|       1 |         5 |          2 | Zuid-Holland  | Rotterdam         |
|       4 |         4 |          3 | Zuid-Holland  | Leiden            |
|       2 |         3 |          4 | Zuid-Holland  | Haag              |
|       3 |         2 |          5 | Zuid-Holland  | Dordrecht         |
|       6 |         1 |          6 | Zuid-Holland  | Delft             |
+---------+-----------+------------+---------------+-------------------+
28 rows in set (0.04 sec)

This example uses the ROW_NUMBER() window function to return the row number of each row within its partition. We use OVER(PARTITION BY District) to achieve this. However, we apply ROW_NUMBER() to three columns, each with different sorting specifications.

The first column in the above example is unsorted, due to the fact that we didn’t explicitly use an ORDER BY clause (by default, partition rows are unordered and row numbering is nondeterministic). The second column is sorted by the Name column in ascending order, and the third column uses descending order.

The Frame Clause

In the syntax section above, I listed a <FRAME clause> as being one of the optional clauses for the OVER clause. The frame clause is typically defined in the form of a ROWS, RANGE, or GROUP clause (or similar, depending on the DBMS). So, we don’t actually use a FRAME keyword in our code. We use ROWS, RANGE, GROUP, or whatever the DBMS specifies as the syntax for the FRAME clause.

When we specify a frame clause, we create frames within the partitions. A frame is a subset of the current partition, and the frame clause specifies how this subset is defined.

Frames are determined with respect to the current row, which enables a frame to move within a partition depending on the location of the current row within its partition.

For example, we can compute running totals for each row by defining a frame to be all rows from the partition start to the current row. Or we can compute rolling averages by defining the frame as extending a specified number of rows on either side of the current row.

Here’s an example that demonstrates this concept:

SELECT
    District,
    Name,
    Population,
    SUM(Population) OVER (PARTITION BY District ORDER BY Name
        ROWS UNBOUNDED PRECEDING)
    AS "Running Total",
    AVG(Population) OVER (PARTITION BY District ORDER BY Name
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
    AS "Running Average"
FROM City
WHERE CountryCode = 'NLD'
ORDER BY District, Name;

Result:

+---------------+-------------------+------------+---------------+-----------------+
| District      | Name              | Population | Running Total | Running Average |
+---------------+-------------------+------------+---------------+-----------------+
| Drenthe       | Emmen             |     105853 |        105853 |     105853.0000 |
| Flevoland     | Almere            |     142465 |        142465 |     142465.0000 |
| Gelderland    | Apeldoorn         |     153491 |        153491 |     145755.5000 |
| Gelderland    | Arnhem            |     138020 |        291511 |     131028.3333 |
| Gelderland    | Ede               |     101574 |        393085 |     130685.6667 |
| Gelderland    | Nijmegen          |     152463 |        545548 |     127018.5000 |
| Groningen     | Groningen         |     172701 |        172701 |     172701.0000 |
| Limburg       | Heerlen           |      95052 |         95052 |     108569.5000 |
| Limburg       | Maastricht        |     122087 |        217139 |     108569.5000 |
| Noord-Brabant | ´s-Hertogenbosch  |     129170 |        129170 |     144784.0000 |
| Noord-Brabant | Breda             |     160398 |        289568 |     163803.6667 |
| Noord-Brabant | Eindhoven         |     201843 |        491411 |     185159.6667 |
| Noord-Brabant | Tilburg           |     193238 |        684649 |     197540.5000 |
| Noord-Holland | Alkmaar           |      92713 |         92713 |     411956.5000 |
| Noord-Holland | Amsterdam         |     731200 |        823913 |     324228.3333 |
| Noord-Holland | Haarlem           |     148772 |        972685 |     330231.3333 |
| Noord-Holland | Haarlemmermeer    |     110722 |       1083407 |     131705.0000 |
| Noord-Holland | Zaanstad          |     135621 |       1219028 |     123171.5000 |
| Overijssel    | Enschede          |     149544 |        149544 |     127681.5000 |
| Overijssel    | Zwolle            |     105819 |        255363 |     127681.5000 |
| Utrecht       | Amersfoort        |     126270 |        126270 |     180296.5000 |
| Utrecht       | Utrecht           |     234323 |        360593 |     180296.5000 |
| Zuid-Holland  | Delft             |      95268 |         95268 |     107539.5000 |
| Zuid-Holland  | Dordrecht         |     119811 |        215079 |     218659.6667 |
| Zuid-Holland  | Haag              |     440900 |        655979 |     225969.0000 |
| Zuid-Holland  | Leiden            |     117196 |        773175 |     383805.6667 |
| Zuid-Holland  | Rotterdam         |     593321 |       1366496 |     273577.0000 |
| Zuid-Holland  | Zoetermeer        |     110214 |       1476710 |     351767.5000 |
+---------------+-------------------+------------+---------------+-----------------+
28 rows in set (0.00 sec)

In this case the query has two ROWS clauses, each with slightly different specifications.

  • The Running Total column uses the SUM() function and an OVER clause with a frame clause to calculate the running total of the population as the result set progresses through the rows.
  • The Running Average column uses the AVG() function and an OVER clause with a frame clause to calculate the rolling averages computed from the current row and the rows that immediately precede and follow it.

In some rows in the above example, the Running Total and Running Average columns share the same value as the Population column. That’s because those rows are the only row in their respective partition, and so there are no other rows being used in the calculation.

Empty OVER Clause

We can also provide an empty OVER clause. An empty OVER clause is an OVER clause without any arguments.

Here’s an example that uses an empty OVER clause:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population",
    SUM(Population) OVER() AS "Total Population"
FROM City 
WHERE CountryCode = 'NLD' 
ORDER BY District, Name, "City Population";

Result:

+---------------+-------------------+-----------------+---------------------+------------------+
| District      | Name              | City Population | District Population | Total Population |
+---------------+-------------------+-----------------+---------------------+------------------+
| Drenthe       | Emmen             |          105853 |              105853 |          5180049 |
| Flevoland     | Almere            |          142465 |              142465 |          5180049 |
| Gelderland    | Apeldoorn         |          153491 |              545548 |          5180049 |
| Gelderland    | Arnhem            |          138020 |              545548 |          5180049 |
| Gelderland    | Ede               |          101574 |              545548 |          5180049 |
| Gelderland    | Nijmegen          |          152463 |              545548 |          5180049 |
| Groningen     | Groningen         |          172701 |              172701 |          5180049 |
| Limburg       | Heerlen           |           95052 |              217139 |          5180049 |
| Limburg       | Maastricht        |          122087 |              217139 |          5180049 |
| Noord-Brabant | ´s-Hertogenbosch  |          129170 |              684649 |          5180049 |
| Noord-Brabant | Breda             |          160398 |              684649 |          5180049 |
| Noord-Brabant | Eindhoven         |          201843 |              684649 |          5180049 |
| Noord-Brabant | Tilburg           |          193238 |              684649 |          5180049 |
| Noord-Holland | Alkmaar           |           92713 |             1219028 |          5180049 |
| Noord-Holland | Amsterdam         |          731200 |             1219028 |          5180049 |
| Noord-Holland | Haarlem           |          148772 |             1219028 |          5180049 |
| Noord-Holland | Haarlemmermeer    |          110722 |             1219028 |          5180049 |
| Noord-Holland | Zaanstad          |          135621 |             1219028 |          5180049 |
| Overijssel    | Enschede          |          149544 |              255363 |          5180049 |
| Overijssel    | Zwolle            |          105819 |              255363 |          5180049 |
| Utrecht       | Amersfoort        |          126270 |              360593 |          5180049 |
| Utrecht       | Utrecht           |          234323 |              360593 |          5180049 |
| Zuid-Holland  | Delft             |           95268 |             1476710 |          5180049 |
| Zuid-Holland  | Dordrecht         |          119811 |             1476710 |          5180049 |
| Zuid-Holland  | Haag              |          440900 |             1476710 |          5180049 |
| Zuid-Holland  | Leiden            |          117196 |             1476710 |          5180049 |
| Zuid-Holland  | Rotterdam         |          593321 |             1476710 |          5180049 |
| Zuid-Holland  | Zoetermeer        |          110214 |             1476710 |          5180049 |
+---------------+-------------------+-----------------+---------------------+------------------+
28 rows in set (0.03 sec)

In this example, the Total Population column uses an empty OVER clause. The empty OVER clause treats the entire set of query rows as a single partition. In this case, we get the grand total of all cities across all districts.

The OVER clause can help in cases where we’re getting an error, due to trying to calculate an aggregate value without a GROUP BY clause. In the above example, the empty OVER clause allowed us to get the aggregate SUM() of all rows without getting an error. Here’s what happens when we remove the OVER clause from the last column:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population",
    SUM(Population) AS "Total Population"
FROM City 
WHERE CountryCode = 'NLD' 
ORDER BY District, Name, "City Population";

Result:

ERROR 1140 (42000): In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'world.City.District'; this is incompatible with sql_mode=only_full_group_by

I ran this in MySQL and so this is the error that MySQL returns when I try to run this code (at least when my sql_mode is set to only_full_group_by).

If I remove only_full_group_by from my SQL mode, I get the following result:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population",
    SUM(Population) AS "Total Population"
FROM City 
WHERE CountryCode = 'NLD' 
ORDER BY District, Name, "City Population";

Result:

+---------------+-----------+-----------------+---------------------+------------------+
| District      | Name      | City Population | District Population | Total Population |
+---------------+-----------+-----------------+---------------------+------------------+
| Noord-Holland | Amsterdam |          731200 |              731200 |          5180049 |
+---------------+-----------+-----------------+---------------------+------------------+
1 row in set (0.00 sec)

Which is not the desired result either.

When we do this (without the only_full_group_by SQL mode), MySQL is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic. Using the ORDER BY clause doesn’t influence the results either. Result set sorting occurs after values have been chosen, and the ORDER BY clause does not affect which value within each group the server chooses.

Of course, we could always go ahead and use the GROUP BY clause, but then that wouldn’t do what we’re trying to do, and it wouldn’t allow us to get the level of granularity that we can get when using the window function. In short, it would be a completely different query. So the OVER clause can be handy in situations like this.

Named Windows

As mentioned, we can also define our window with the WINDOW clause. When we do this, we create a named window. A named window represents a group of rows in a table that can be used by a window function.

Using named windows can be useful if we have multiple columns that use the same window function definition. This saves us from having to create duplicate code when defining each window function.

See Understanding the WINDOW Clause in MySQL for examples of creating named windows in MySQL. The syntax should be similar, if not the same, across most major RDBMSs.