Sometimes when we run SQL queries we might want to group the results by a given column. A common way to do this is with the GROUP BY
clause.
But sometimes the GROUP BY
clause isn’t enough.
Sometimes we want the results presented in a way that the GROUP BY
clause doesn’t cater for. The GROUP BY
is perfect if we want to collapse groups of rows into a single row, so that we can get an aggregate value for each set of collapsed rows. But sometimes we don’t want this. Sometimes we want to see all of the rows, as well as the aggregate values.
This is where the PARTITION BY
clause comes in.
What is a PARTITION CLAUSE
?
The partition clause is one of the clauses that can be used as part of a window function. It can be used to divide the query result set into specified partitions.
A window function is a kind of aggregate-like operation that operates on a set of query rows. But window operations are different to aggregate operations. An aggregate operation groups query rows into a single result row, whereas a window function produces a result for each query row.
Window functions consist of an OVER
clause, which itself can contain other clauses, one of which is the PARTITION BY
clause. The other clauses can be used in conjunction with the PARTITION BY
clause to further refine the results.
A window function could include the following clauses in its OVER
clause:
<PARTITION BY clause>
– Divides the result set into partitions.<ORDER BY clause>
– Defines the logical order of the rows within each partition of the result set.<FRAME clause>
– Creates frames within the partitions. A frame is a subset of the current partition, and the frame clause specifies how this subset is defined. The frame clause is typically defined in the form of aROWS
,RANGE
, orGROUP
clause (or similar, depending on the DBMS).
The exact syntax and clauses accepted in a window function will depend on the DBMS being used, so it pays to check the documentation for your DBMS.
Example
Consider the following query:
SELECT
Name,
District,
Population
FROM City
WHERE CountryCode = 'AUS';
Result:
+---------------+-----------------+------------+ | Name | District | Population | +---------------+-----------------+------------+ | Sydney | New South Wales | 3276207 | | Melbourne | Victoria | 2865329 | | Brisbane | Queensland | 1291117 | | Perth | West Australia | 1096829 | | Adelaide | South Australia | 978100 | | Canberra | Capital Region | 322723 | | Gold Coast | Queensland | 311932 | | Newcastle | New South Wales | 270324 | | Central Coast | New South Wales | 227657 | | Wollongong | New South Wales | 219761 | | Hobart | Tasmania | 126118 | | Geelong | Victoria | 125382 | | Townsville | Queensland | 109914 | | Cairns | Queensland | 92273 | +---------------+-----------------+------------+ 14 rows in set (0.00 sec)
This query returns a list of cities in Australia (i.e. those with a CountryCode
of AUS
).
A common modification to this query might be to group our results by district, and return the population of each district:
SELECT
District,
SUM(Population) AS "District Population"
FROM City WHERE CountryCode = 'AUS'
GROUP BY District;
Result:
+-----------------+---------------------+ | District | District Population | +-----------------+---------------------+ | New South Wales | 3993949 | | Victoria | 2990711 | | Queensland | 1805236 | | West Australia | 1096829 | | South Australia | 978100 | | Capital Region | 322723 | | Tasmania | 126118 | +-----------------+---------------------+ 7 rows in set (0.00 sec)
Here, we use the GROUP BY
clause to group the results by district. We also use the SUM()
function to calculate the total population of all cities within each district.
Note that I haven’t used the PARTITION BY
clause yet. That’s next.
The above query is great if all we want to see is the population of each district. But what if we also want to see the population of each city within each district? The problem with the GROUP BY
clause in this case is that it collapses multiple rows into a single row. So by having each district collapsed into a single row, there’s only one row for which to list all cities within each district.
One solution is to output a comma-separated list of the cities into a single cell.
But that won’t help if we want a different row for each city. Fortunately, we can achieve this with the PARTITION BY
clause.
Let’s modify the above query to use the PARTITION BY
clause:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population"
FROM City
WHERE CountryCode = 'AUS'
ORDER BY District, Name, "City Population";
Result:
+-----------------+---------------+-----------------+---------------------+ | District | Name | City Population | District Population | +-----------------+---------------+-----------------+---------------------+ | Capital Region | Canberra | 322723 | 322723 | | New South Wales | Central Coast | 227657 | 3993949 | | New South Wales | Newcastle | 270324 | 3993949 | | New South Wales | Sydney | 3276207 | 3993949 | | New South Wales | Wollongong | 219761 | 3993949 | | Queensland | Brisbane | 1291117 | 1805236 | | Queensland | Cairns | 92273 | 1805236 | | Queensland | Gold Coast | 311932 | 1805236 | | Queensland | Townsville | 109914 | 1805236 | | South Australia | Adelaide | 978100 | 978100 | | Tasmania | Hobart | 126118 | 126118 | | Victoria | Geelong | 125382 | 2990711 | | Victoria | Melbourne | 2865329 | 2990711 | | West Australia | Perth | 1096829 | 1096829 | +-----------------+---------------+-----------------+---------------------+ 14 rows in set (0.01 sec)
Now we get to see the population of all cities and their districts.
Bear in mind that the PARTITION BY
clause is an optional argument for the OVER
clause. We can also provide other arguments or we can drop the arguments altogether (i.e. call OVER()
without any arguments) to provide an empty OVER
clause.
Here’s what happens when I provide an empty OVER
clause:
SELECT
District,
Name,
Population AS "City Population",
SUM(Population) OVER(PARTITION BY District) AS "District Population",
SUM(Population) OVER() AS "Total Population"
FROM City
WHERE CountryCode = 'AUS'
ORDER BY District, Name, "City Population";
Result:
+-----------------+---------------+-----------------+---------------------+------------------+ | District | Name | City Population | District Population | Total Population | +-----------------+---------------+-----------------+---------------------+------------------+ | Capital Region | Canberra | 322723 | 322723 | 11313666 | | New South Wales | Central Coast | 227657 | 3993949 | 11313666 | | New South Wales | Newcastle | 270324 | 3993949 | 11313666 | | New South Wales | Sydney | 3276207 | 3993949 | 11313666 | | New South Wales | Wollongong | 219761 | 3993949 | 11313666 | | Queensland | Brisbane | 1291117 | 1805236 | 11313666 | | Queensland | Cairns | 92273 | 1805236 | 11313666 | | Queensland | Gold Coast | 311932 | 1805236 | 11313666 | | Queensland | Townsville | 109914 | 1805236 | 11313666 | | South Australia | Adelaide | 978100 | 978100 | 11313666 | | Tasmania | Hobart | 126118 | 126118 | 11313666 | | Victoria | Geelong | 125382 | 2990711 | 11313666 | | Victoria | Melbourne | 2865329 | 2990711 | 11313666 | | West Australia | Perth | 1096829 | 1096829 | 11313666 | +-----------------+---------------+-----------------+---------------------+------------------+ 14 rows in set (0.04 sec)
Here, I added an extra column with an empty OVER
clause. It still works, but the empty OVER
clause treats the entire set of query rows as a single partition, and therefore calculates the grand total of all cities across all districts.
I think this is a good demonstration of how the PARTITION BY
clause works because we can see the empty OVER
clause next to one with the PARTITION BY
clause. The PARTITION BY
clause partitions the results of the OVER
clause, so that we can see aggregate data at a more granular level.
The ORDER BY
Clause
We can add an ORDER BY
clause to sort the rows within each partition.
Here’s an example that demonstrates this:
SELECT
ROW_NUMBER() OVER(PARTITION BY District) AS "Default",
ROW_NUMBER() OVER(PARTITION BY District ORDER BY Name ASC) AS Ascending,
ROW_NUMBER() OVER(PARTITION BY District ORDER BY Name DESC) AS Descending,
District,
Name AS "City Name"
FROM City
WHERE CountryCode = 'AUS';
Result:
+---------+-----------+------------+-----------------+---------------+ | Default | Ascending | Descending | District | City Name | +---------+-----------+------------+-----------------+---------------+ | 1 | 1 | 1 | Capital Region | Canberra | | 4 | 4 | 1 | New South Wales | Wollongong | | 1 | 3 | 2 | New South Wales | Sydney | | 2 | 2 | 3 | New South Wales | Newcastle | | 3 | 1 | 4 | New South Wales | Central Coast | | 3 | 4 | 1 | Queensland | Townsville | | 2 | 3 | 2 | Queensland | Gold Coast | | 4 | 2 | 3 | Queensland | Cairns | | 1 | 1 | 4 | Queensland | Brisbane | | 1 | 1 | 1 | South Australia | Adelaide | | 1 | 1 | 1 | Tasmania | Hobart | | 1 | 2 | 1 | Victoria | Melbourne | | 2 | 1 | 2 | Victoria | Geelong | | 1 | 1 | 1 | West Australia | Perth | +---------+-----------+------------+-----------------+---------------+ 14 rows in set (0.00 sec)
This example uses the ROW_NUMBER()
window function to return the row number of the current row. We use OVER(PARTITION BY District)
to specify that it’s for each row within its partition. So instead of having the row numbering go from 1 to 14, it resets to 1 at the start of each new partition.
In this example I applied ROW_NUMBER()
to three columns, each with different sorting specifications. The first column is unsorted, due to the fact that we didn’t explicitly use an ORDER BY
clause (by default, partition rows are unordered and row numbering is nondeterministic). The second column is sorted by the Name
column in ascending order, and the third column uses descending order.
So the actual values in each of those three columns is different, depending on how we’ve sorted it – or not sorted it – using the ORDER BY
clause.
The Frame Clause
We have the option of adding a frame clause to our PARTITION BY
clause. The frame clause is typically defined in the form of a ROWS
, RANGE
, or GROUP
clause (or similar, depending on the DBMS).
When we specify a frame clause, we create frames within the partitions. A frame is a subset of the current partition, and the frame clause specifies how this subset is defined.
Frames are determined with respect to the current row, which enables a frame to move within a partition depending on the location of the current row within its partition.
For example, we can compute running totals for each row by defining a frame to be all rows from the partition start to the current row. Or we can compute rolling averages by defining the frame as extending a specified number of rows on either side of the current row.
The following example demonstrates this:
SELECT
District,
Name,
Population,
SUM(Population) OVER (PARTITION BY District ORDER BY Name
ROWS UNBOUNDED PRECEDING) AS "Running Total",
AVG(Population) OVER (PARTITION BY District ORDER BY Name
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS "Running Average"
FROM City
WHERE CountryCode = 'AUS'
ORDER BY District, Name;
Result:
+-----------------+---------------+------------+---------------+-----------------+ | District | Name | Population | Running Total | Running Average | +-----------------+---------------+------------+---------------+-----------------+ | Capital Region | Canberra | 322723 | 322723 | 322723.0000 | | New South Wales | Central Coast | 227657 | 227657 | 248990.5000 | | New South Wales | Newcastle | 270324 | 497981 | 1258062.6667 | | New South Wales | Sydney | 3276207 | 3774188 | 1255430.6667 | | New South Wales | Wollongong | 219761 | 3993949 | 1747984.0000 | | Queensland | Brisbane | 1291117 | 1291117 | 691695.0000 | | Queensland | Cairns | 92273 | 1383390 | 565107.3333 | | Queensland | Gold Coast | 311932 | 1695322 | 171373.0000 | | Queensland | Townsville | 109914 | 1805236 | 210923.0000 | | South Australia | Adelaide | 978100 | 978100 | 978100.0000 | | Tasmania | Hobart | 126118 | 126118 | 126118.0000 | | Victoria | Geelong | 125382 | 125382 | 1495355.5000 | | Victoria | Melbourne | 2865329 | 2990711 | 1495355.5000 | | West Australia | Perth | 1096829 | 1096829 | 1096829.0000 | +-----------------+---------------+------------+---------------+-----------------+ 14 rows in set (0.01 sec)
In this case the query has two ROWS
clauses, each with slightly different specifications.
- The
Running Total
column uses theSUM()
function and anOVER
clause with a frame clause to calculate the running total of the population as the result set progresses through the rows. In this case, the frame clause includesUNBOUNDED PRECEDING
, which specifies that the bound is the first partition row (i.e. the row values are calculated based on the first row in the partition). - The
Running Average
column uses theAVG()
function and anOVER
clause with a frame clause to calculate the rolling averages computed from the current row and the rows that immediately precede and follow it. In this case, instead of usingUNBOUNDED PRECEDING
, we useROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
instead.
In some rows in our example, the Running Total
and Running Average
columns share the same value as the Population
column (the first row is one such row). That’s because those rows are the only row in their respective partition, and so there are no other rows being used in the calculation.
Conclusion
The PARTITION BY
clause is one of the clauses that can be used as part of a window function. It can be used to divide the query result set into specified partitions.
There are also some optional clauses that we can use to modify how data is presented within each partition. This includes ordering the results within each partition, as well as creating “frames” (or subsets) within each partition.
The above examples should work in most of the major DBMSs that support window functions, but each DBMS has its own syntax, so it pays to consult with your DBMS’s documentation for the complete syntax and functionality available to you.