Understanding the PARTITION BY Clause in SQL

Sometimes when we run SQL queries we might want to group the results by a given column. A common way to do this is with the GROUP BY clause.

But sometimes the GROUP BY clause isn’t enough.

Sometimes we want the results presented in a way that the GROUP BY clause doesn’t cater for. The GROUP BY is perfect if we want to collapse groups of rows into a single row, so that we can get an aggregate value for each set of collapsed rows. But sometimes we don’t want this. Sometimes we want to see all of the rows, as well as the aggregate values.

This is where the PARTITION BY clause comes in.

What is a PARTITION CLAUSE?

The partition clause is one of the clauses that can be used as part of a window function. It can be used to divide the query result set into specified partitions.

A window function is a kind of aggregate-like operation that operates on a set of query rows. But window operations are different to aggregate operations. An aggregate operation groups query rows into a single result row, whereas a window function produces a result for each query row.

Window functions consist of an OVER clause, which itself can contain other clauses, one of which is the PARTITION BY clause. The other clauses can be used in conjunction with the PARTITION BY clause to further refine the results.

A window function could include the following clauses in its OVER clause:

  • <PARTITION BY clause> – Divides the result set into partitions.
  • <ORDER BY clause> – Defines the logical order of the rows within each partition of the result set.
  • <FRAME clause> – Creates frames within the partitions. A frame is a subset of the current partition, and the frame clause specifies how this subset is defined. The frame clause is typically defined in the form of a ROWS, RANGE, or GROUP clause (or similar, depending on the DBMS).

The exact syntax and clauses accepted in a window function will depend on the DBMS being used, so it pays to check the documentation for your DBMS.

Example

Consider the following query:

SELECT 
    Name,
    District, 
    Population
FROM City 
WHERE CountryCode = 'AUS';

Result:

+---------------+-----------------+------------+
| Name          | District        | Population |
+---------------+-----------------+------------+
| Sydney        | New South Wales |    3276207 |
| Melbourne     | Victoria        |    2865329 |
| Brisbane      | Queensland      |    1291117 |
| Perth         | West Australia  |    1096829 |
| Adelaide      | South Australia |     978100 |
| Canberra      | Capital Region  |     322723 |
| Gold Coast    | Queensland      |     311932 |
| Newcastle     | New South Wales |     270324 |
| Central Coast | New South Wales |     227657 |
| Wollongong    | New South Wales |     219761 |
| Hobart        | Tasmania        |     126118 |
| Geelong       | Victoria        |     125382 |
| Townsville    | Queensland      |     109914 |
| Cairns        | Queensland      |      92273 |
+---------------+-----------------+------------+
14 rows in set (0.00 sec)

This query returns a list of cities in Australia (i.e. those with a CountryCode of AUS).

A common modification to this query might be to group our results by district, and return the population of each district:

SELECT 
    District, 
    SUM(Population) AS "District Population"
FROM City WHERE CountryCode = 'AUS'
GROUP BY District;

Result:

+-----------------+---------------------+
| District        | District Population |
+-----------------+---------------------+
| New South Wales |             3993949 |
| Victoria        |             2990711 |
| Queensland      |             1805236 |
| West Australia  |             1096829 |
| South Australia |              978100 |
| Capital Region  |              322723 |
| Tasmania        |              126118 |
+-----------------+---------------------+
7 rows in set (0.00 sec)

Here, we use the GROUP BY clause to group the results by district. We also use the SUM() function to calculate the total population of all cities within each district.

Note that I haven’t used the PARTITION BY clause yet. That’s next.

The above query is great if all we want to see is the population of each district. But what if we also want to see the population of each city within each district? The problem with the GROUP BY clause in this case is that it collapses multiple rows into a single row. So by having each district collapsed into a single row, there’s only one row for which to list all cities within each district.

One solution is to output a comma-separated list of the cities into a single cell.

But that won’t help if we want a different row for each city. Fortunately, we can achieve this with the PARTITION BY clause.

Let’s modify the above query to use the PARTITION BY clause:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population" 
FROM City 
WHERE CountryCode = 'AUS' 
ORDER BY District, Name, "City Population";

Result:

+-----------------+---------------+-----------------+---------------------+
| District        | Name          | City Population | District Population |
+-----------------+---------------+-----------------+---------------------+
| Capital Region  | Canberra      |          322723 |              322723 |
| New South Wales | Central Coast |          227657 |             3993949 |
| New South Wales | Newcastle     |          270324 |             3993949 |
| New South Wales | Sydney        |         3276207 |             3993949 |
| New South Wales | Wollongong    |          219761 |             3993949 |
| Queensland      | Brisbane      |         1291117 |             1805236 |
| Queensland      | Cairns        |           92273 |             1805236 |
| Queensland      | Gold Coast    |          311932 |             1805236 |
| Queensland      | Townsville    |          109914 |             1805236 |
| South Australia | Adelaide      |          978100 |              978100 |
| Tasmania        | Hobart        |          126118 |              126118 |
| Victoria        | Geelong       |          125382 |             2990711 |
| Victoria        | Melbourne     |         2865329 |             2990711 |
| West Australia  | Perth         |         1096829 |             1096829 |
+-----------------+---------------+-----------------+---------------------+
14 rows in set (0.01 sec)

Now we get to see the population of all cities and their districts.

Bear in mind that the PARTITION BY clause is an optional argument for the OVER clause. We can also provide other arguments or we can drop the arguments altogether (i.e. call OVER() without any arguments) to provide an empty OVER clause.

Here’s what happens when I provide an empty OVER clause:

SELECT 
    District, 
    Name, 
    Population AS "City Population", 
    SUM(Population) OVER(PARTITION BY District) AS "District Population",
    SUM(Population) OVER() AS "Total Population"
FROM City 
WHERE CountryCode = 'AUS' 
ORDER BY District, Name, "City Population";

Result:

+-----------------+---------------+-----------------+---------------------+------------------+
| District        | Name          | City Population | District Population | Total Population |
+-----------------+---------------+-----------------+---------------------+------------------+
| Capital Region  | Canberra      |          322723 |              322723 |         11313666 |
| New South Wales | Central Coast |          227657 |             3993949 |         11313666 |
| New South Wales | Newcastle     |          270324 |             3993949 |         11313666 |
| New South Wales | Sydney        |         3276207 |             3993949 |         11313666 |
| New South Wales | Wollongong    |          219761 |             3993949 |         11313666 |
| Queensland      | Brisbane      |         1291117 |             1805236 |         11313666 |
| Queensland      | Cairns        |           92273 |             1805236 |         11313666 |
| Queensland      | Gold Coast    |          311932 |             1805236 |         11313666 |
| Queensland      | Townsville    |          109914 |             1805236 |         11313666 |
| South Australia | Adelaide      |          978100 |              978100 |         11313666 |
| Tasmania        | Hobart        |          126118 |              126118 |         11313666 |
| Victoria        | Geelong       |          125382 |             2990711 |         11313666 |
| Victoria        | Melbourne     |         2865329 |             2990711 |         11313666 |
| West Australia  | Perth         |         1096829 |             1096829 |         11313666 |
+-----------------+---------------+-----------------+---------------------+------------------+
14 rows in set (0.04 sec)

Here, I added an extra column with an empty OVER clause. It still works, but the empty OVER clause treats the entire set of query rows as a single partition, and therefore calculates the grand total of all cities across all districts.  

I think this is a good demonstration of how the PARTITION BY clause works because we can see the empty OVER clause next to one with the PARTITION BY clause. The PARTITION BY clause partitions the results of the OVER clause, so that we can see aggregate data at a more granular level.

The ORDER BY Clause

We can add an ORDER BY clause to sort the rows within each partition.

Here’s an example that demonstrates this:

SELECT 
    ROW_NUMBER() OVER(PARTITION BY District) AS "Default",
    ROW_NUMBER() OVER(PARTITION BY District ORDER BY Name ASC) AS Ascending,
    ROW_NUMBER() OVER(PARTITION BY District ORDER BY Name DESC) AS Descending,
    District,
    Name AS "City Name"
FROM City 
WHERE CountryCode = 'AUS';

Result:

+---------+-----------+------------+-----------------+---------------+
| Default | Ascending | Descending | District        | City Name     |
+---------+-----------+------------+-----------------+---------------+
|       1 |         1 |          1 | Capital Region  | Canberra      |
|       4 |         4 |          1 | New South Wales | Wollongong    |
|       1 |         3 |          2 | New South Wales | Sydney        |
|       2 |         2 |          3 | New South Wales | Newcastle     |
|       3 |         1 |          4 | New South Wales | Central Coast |
|       3 |         4 |          1 | Queensland      | Townsville    |
|       2 |         3 |          2 | Queensland      | Gold Coast    |
|       4 |         2 |          3 | Queensland      | Cairns        |
|       1 |         1 |          4 | Queensland      | Brisbane      |
|       1 |         1 |          1 | South Australia | Adelaide      |
|       1 |         1 |          1 | Tasmania        | Hobart        |
|       1 |         2 |          1 | Victoria        | Melbourne     |
|       2 |         1 |          2 | Victoria        | Geelong       |
|       1 |         1 |          1 | West Australia  | Perth         |
+---------+-----------+------------+-----------------+---------------+
14 rows in set (0.00 sec)

This example uses the ROW_NUMBER() window function to return the row number of the current row. We use OVER(PARTITION BY District) to specify that it’s for each row within its partition. So instead of having the row numbering go from 1 to 14, it resets to 1 at the start of each new partition.

In this example I applied ROW_NUMBER() to three columns, each with different sorting specifications. The first column is unsorted, due to the fact that we didn’t explicitly use an ORDER BY clause (by default, partition rows are unordered and row numbering is nondeterministic). The second column is sorted by the Name column in ascending order, and the third column uses descending order.

So the actual values in each of those three columns is different, depending on how we’ve sorted it – or not sorted it – using the ORDER BY clause.

The Frame Clause

We have the option of adding a frame clause to our PARTITION BY clause. The frame clause is typically defined in the form of a ROWS, RANGE, or GROUP clause (or similar, depending on the DBMS).

When we specify a frame clause, we create frames within the partitions. A frame is a subset of the current partition, and the frame clause specifies how this subset is defined.

Frames are determined with respect to the current row, which enables a frame to move within a partition depending on the location of the current row within its partition.

For example, we can compute running totals for each row by defining a frame to be all rows from the partition start to the current row. Or we can compute rolling averages by defining the frame as extending a specified number of rows on either side of the current row.

The following example demonstrates this:

SELECT
    District,
    Name,
    Population,
    SUM(Population) OVER (PARTITION BY District ORDER BY Name
        ROWS UNBOUNDED PRECEDING) AS "Running Total",
    AVG(Population) OVER (PARTITION BY District ORDER BY Name
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS "Running Average"
FROM City
WHERE CountryCode = 'AUS'
ORDER BY District, Name;

Result:

+-----------------+---------------+------------+---------------+-----------------+
| District        | Name          | Population | Running Total | Running Average |
+-----------------+---------------+------------+---------------+-----------------+
| Capital Region  | Canberra      |     322723 |        322723 |     322723.0000 |
| New South Wales | Central Coast |     227657 |        227657 |     248990.5000 |
| New South Wales | Newcastle     |     270324 |        497981 |    1258062.6667 |
| New South Wales | Sydney        |    3276207 |       3774188 |    1255430.6667 |
| New South Wales | Wollongong    |     219761 |       3993949 |    1747984.0000 |
| Queensland      | Brisbane      |    1291117 |       1291117 |     691695.0000 |
| Queensland      | Cairns        |      92273 |       1383390 |     565107.3333 |
| Queensland      | Gold Coast    |     311932 |       1695322 |     171373.0000 |
| Queensland      | Townsville    |     109914 |       1805236 |     210923.0000 |
| South Australia | Adelaide      |     978100 |        978100 |     978100.0000 |
| Tasmania        | Hobart        |     126118 |        126118 |     126118.0000 |
| Victoria        | Geelong       |     125382 |        125382 |    1495355.5000 |
| Victoria        | Melbourne     |    2865329 |       2990711 |    1495355.5000 |
| West Australia  | Perth         |    1096829 |       1096829 |    1096829.0000 |
+-----------------+---------------+------------+---------------+-----------------+
14 rows in set (0.01 sec)

In this case the query has two ROWS clauses, each with slightly different specifications.

  • The Running Total column uses the SUM() function and an OVER clause with a frame clause to calculate the running total of the population as the result set progresses through the rows. In this case, the frame clause includes UNBOUNDED PRECEDING, which specifies that the bound is the first partition row (i.e. the row values are calculated based on the first row in the partition).
  • The Running Average column uses the AVG() function and an OVER clause with a frame clause to calculate the rolling averages computed from the current row and the rows that immediately precede and follow it. In this case, instead of using UNBOUNDED PRECEDING, we use ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING instead.

In some rows in our example, the Running Total and Running Average columns share the same value as the Population column (the first row is one such row). That’s because those rows are the only row in their respective partition, and so there are no other rows being used in the calculation.

Conclusion

The PARTITION BY clause is one of the clauses that can be used as part of a window function. It can be used to divide the query result set into specified partitions.

There are also some optional clauses that we can use to modify how data is presented within each partition. This includes ordering the results within each partition, as well as creating “frames” (or subsets) within each partition.

The above examples should work in most of the major DBMSs that support window functions, but each DBMS has its own syntax, so it pays to consult with your DBMS’s documentation for the complete syntax and functionality available to you.