Understanding the DISTINCT ON Option in PostgreSQL

Most major RDBMSs support the DISTINCT clause, which allows us to get unique – or “distinct” – rows from our SQL queries. But PostgreSQL’s implementation of this clause has an extra option that most other RDBMSs don’t include.

PostgreSQL allows us to include the ON() option when using DISTINCT. This enables us to specify exactly which columns should be evaluated by the DISTINCT modifier, while simultaneously allowing us to return columns that aren’t evaluated.

Basically, it means we can return whatever columns we want, while applying DISTINCT to just a subset of them. In other words, not all columns returned in our query need to be evaluated by DISTINCT.

Without the ON option, all columns in our SELECT list are evaluated by the DISTINCT modifier.

Example

Suppose we have the following table:

SELECT * FROM Dogs;

Result:

 dogid | firstname | lastname  | age 
-------+-----------+-----------+-----
     1 | Bark      | Smith     |  10
     2 | Wag       | Frazier   |   7
     3 | Bark      | Jones     |  10
     4 | Bite      | Rattlebag |  10

Here’s an example of using DISTINCT ON against that table:

SELECT 
    DISTINCT ON(FirstName, Age) 
    *
FROM Dogs
ORDER BY FirstName, Age;

Result:

 dogid | firstname | lastname  | age 
-------+-----------+-----------+-----
     1 | Bark      | Smith     |  10
     4 | Bite      | Rattlebag |  10
     2 | Wag       | Frazier   |   7

This query returned all columns for the rows that are distinct across their FirstName and Age columns. So if two or more rows have the same values in both their FirstName and Age columns, then they’re presented as just one row.

The good thing about the ON option is that it allows us to present as many columns as we like, without them needing to be evaluated by the DISTINCT modifier. That’s why we were able to return all columns in the above query, but only evaluate two of them for the DISTINCT modifier.

Here it is with just one column evaluated by DISTINCT ON():

SELECT 
    DISTINCT ON(Age) 
    *
FROM Dogs
ORDER BY Age;

Result:

 dogid | firstname | lastname | age 
-------+-----------+----------+-----
     2 | Wag       | Frazier  |   7
     1 | Bark      | Smith    |  10

The results are narrowed down further. That’s because we’re applying DISTINCT ON to just one column – the Age column. The table contains three rows with the same value, and so only one row is returned for that value.

Removing the ON Option

Most other RDBMSs don’t include the ON option (at least at the time of writing), and so we can only return those columns that are evaluated in the DISTINCT modifier.

With PostgreSQL, we can also remove the ON option.

Here’s how we’d return the same rows without the ON option:

SELECT 
    DISTINCT FirstName, Age
FROM Dogs
ORDER BY FirstName, Age;

Result:

 firstname | age 
-----------+-----
 Bark      |  10
 Bite      |  10
 Wag       |   7

So even though we got the same rows, we weren’t able to return all columns. If we wanted to return all columns, then the DISTINCT clause would be looking for rows that are unique across all columns. This wouldn’t work in our case, as each row is unique.

Here’s what I mean:

SELECT 
    DISTINCT *
FROM Dogs
ORDER BY FirstName, Age;

Result:

 dogid | firstname | lastname  | age 
-------+-----------+-----------+-----
     1 | Bark      | Smith     |  10
     3 | Bark      | Jones     |  10
     4 | Bite      | Rattlebag |  10
     2 | Wag       | Frazier   |   7

The whole table is returned.

So we can see that the ON option can be very handy for separating the columns returned from those that are evaluated by DISTINCT.

Evaluating Columns Not in the SELECT List

Given the ON option allows us to separate the SELECT list from the DISTINCT list, we can return a completely different set of columns from those evaluated by the DISTINCT modifier.

Example:

SELECT 
    DISTINCT ON(FirstName, Age) 
    DogId,
    LastName
FROM Dogs
ORDER BY FirstName, Age;

Result:

 dogid | lastname  
-------+-----------
     1 | Smith
     4 | Rattlebag
     2 | Frazier

In this example I returned the DogId and LastName columns, but my DISTINCT modifier evaluated the FirstName and Age columns.

About the ORDER BY Clause

One thing to be mindful of when using DISTINCT ON, is that the ORDER BY clause, if provided, must start with the same columns that are being evaluated by the DISTINCT ON clause.

In other words, we can’t start the ORDER BY clause with a column that isn’t passed to DISTINCT ON(). If we do that, we’ll get an error, like this:

SELECT 
    DISTINCT ON(FirstName, Age) 
    *
FROM Dogs
ORDER BY DogId, FirstName, Age;

Result:

SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Here I started my ORDER BY clause with the DogId column, but I didn’t include that column in my DISTINCT ON clause.

If I want to include DogId in the ORDER BY clause, either I’d need to add that to the DISTINCT ON list, or I should put it after the columns that are in the DISTINCT ON list.

Let’s do that later:

SELECT 
    DISTINCT ON(FirstName, Age) 
    *
FROM Dogs
ORDER BY FirstName, Age, DogId;

Result:

 dogid | firstname | lastname  | age 
-------+-----------+-----------+-----
     1 | Bark      | Smith     |  10
     4 | Bite      | Rattlebag |  10
     2 | Wag       | Frazier   |   7

This time it worked without error.