Most major RDBMSs support the DISTINCT
clause, which allows us to get unique – or “distinct” – rows from our SQL queries. But PostgreSQL’s implementation of this clause has an extra option that most other RDBMSs don’t include.
PostgreSQL allows us to include the ON()
option when using DISTINCT
. This enables us to specify exactly which columns should be evaluated by the DISTINCT
modifier, while simultaneously allowing us to return columns that aren’t evaluated.
Basically, it means we can return whatever columns we want, while applying DISTINCT
to just a subset of them. In other words, not all columns returned in our query need to be evaluated by DISTINCT
.
Without the ON
option, all columns in our SELECT
list are evaluated by the DISTINCT
modifier.
Example
Suppose we have the following table:
SELECT * FROM Dogs;
Result:
dogid | firstname | lastname | age -------+-----------+-----------+----- 1 | Bark | Smith | 10 2 | Wag | Frazier | 7 3 | Bark | Jones | 10 4 | Bite | Rattlebag | 10
Here’s an example of using DISTINCT ON
against that table:
SELECT
DISTINCT ON(FirstName, Age)
*
FROM Dogs
ORDER BY FirstName, Age;
Result:
dogid | firstname | lastname | age -------+-----------+-----------+----- 1 | Bark | Smith | 10 4 | Bite | Rattlebag | 10 2 | Wag | Frazier | 7
This query returned all columns for the rows that are distinct across their FirstName
and Age
columns. So if two or more rows have the same values in both their FirstName
and Age
columns, then they’re presented as just one row.
The good thing about the ON
option is that it allows us to present as many columns as we like, without them needing to be evaluated by the DISTINCT
modifier. That’s why we were able to return all columns in the above query, but only evaluate two of them for the DISTINCT
modifier.
Here it is with just one column evaluated by DISTINCT ON()
:
SELECT
DISTINCT ON(Age)
*
FROM Dogs
ORDER BY Age;
Result:
dogid | firstname | lastname | age -------+-----------+----------+----- 2 | Wag | Frazier | 7 1 | Bark | Smith | 10
The results are narrowed down further. That’s because we’re applying DISTINCT ON
to just one column – the Age
column. The table contains three rows with the same value, and so only one row is returned for that value.
Removing the ON
Option
Most other RDBMSs don’t include the ON
option (at least at the time of writing), and so we can only return those columns that are evaluated in the DISTINCT
modifier.
With PostgreSQL, we can also remove the ON
option.
Here’s how we’d return the same rows without the ON
option:
SELECT
DISTINCT FirstName, Age
FROM Dogs
ORDER BY FirstName, Age;
Result:
firstname | age -----------+----- Bark | 10 Bite | 10 Wag | 7
So even though we got the same rows, we weren’t able to return all columns. If we wanted to return all columns, then the DISTINCT
clause would be looking for rows that are unique across all columns. This wouldn’t work in our case, as each row is unique.
Here’s what I mean:
SELECT
DISTINCT *
FROM Dogs
ORDER BY FirstName, Age;
Result:
dogid | firstname | lastname | age -------+-----------+-----------+----- 1 | Bark | Smith | 10 3 | Bark | Jones | 10 4 | Bite | Rattlebag | 10 2 | Wag | Frazier | 7
The whole table is returned.
So we can see that the ON
option can be very handy for separating the columns returned from those that are evaluated by DISTINCT
.
Evaluating Columns Not in the SELECT
List
Given the ON
option allows us to separate the SELECT
list from the DISTINCT
list, we can return a completely different set of columns from those evaluated by the DISTINCT
modifier.
Example:
SELECT
DISTINCT ON(FirstName, Age)
DogId,
LastName
FROM Dogs
ORDER BY FirstName, Age;
Result:
dogid | lastname -------+----------- 1 | Smith 4 | Rattlebag 2 | Frazier
In this example I returned the DogId
and LastName
columns, but my DISTINCT
modifier evaluated the FirstName
and Age
columns.
About the ORDER BY
Clause
One thing to be mindful of when using DISTINCT ON
, is that the ORDER BY
clause, if provided, must start with the same columns that are being evaluated by the DISTINCT ON
clause.
In other words, we can’t start the ORDER BY
clause with a column that isn’t passed to DISTINCT ON()
. If we do that, we’ll get an error, like this:
SELECT
DISTINCT ON(FirstName, Age)
*
FROM Dogs
ORDER BY DogId, FirstName, Age;
Result:
SELECT DISTINCT ON expressions must match initial ORDER BY expressions
Here I started my ORDER BY
clause with the DogId
column, but I didn’t include that column in my DISTINCT ON
clause.
If I want to include DogId
in the ORDER BY
clause, either I’d need to add that to the DISTINCT ON
list, or I should put it after the columns that are in the DISTINCT ON
list.
Let’s do that later:
SELECT
DISTINCT ON(FirstName, Age)
*
FROM Dogs
ORDER BY FirstName, Age, DogId;
Result:
dogid | firstname | lastname | age -------+-----------+-----------+----- 1 | Bark | Smith | 10 4 | Bite | Rattlebag | 10 2 | Wag | Frazier | 7
This time it worked without error.