Using the CUME_DIST() Function to Get the Cumulative Distribution in PostgreSQL

In PostgreSQL, we can use the cume_dist() function to return the cumulative distribution of a value within a group of values.

It calculates this as follows: (the number of partition rows preceding or peers with current row) / (total partition rows).

The return value ranges from 1/N to 1.

Syntax

The syntax goes like this:

cume_dist() over_clause

The OVER clause determines how the rowset is partitioned and ordered before the window function is applied.

Example

Here’s a basic example to demonstrate how it works:

SELECT 
    vendorid,
    productname,
    productprice,
    cume_dist() OVER ( ORDER BY productprice )
FROM products;

Result:

 vendorid |           productname           | productprice | cume_dist 
----------+---------------------------------+--------------+-----------
     1004 | Bottomless Coffee Mugs (4 Pack) |         9.99 |       0.1
     1003 | Hammock                         |        10.00 |       0.2
     1001 | Long Weight (green)             |        11.99 |       0.3
     1004 | Tea Pot                         |        12.45 |       0.4
     1001 | Long Weight (blue)              |        14.75 |       0.5
     1001 | Left handed screwdriver         |        25.99 |       0.7
     1001 | Right handed screwdriver        |        25.99 |       0.7
     1002 | Sledge Hammer                   |        33.49 |       0.8
     1003 | Straw Dog Box                   |        55.99 |       0.9
     1003 | Chainsaw                        |       245.00 |         1
(10 rows)

In this example, both the Left handed screwdriver and Right handed screwdriver have the same price. Therefore, they share the same cumulative distribution (in this case, 0.7).

The ORDER BY clause is important when we use the cume_dist() function. If we omit it, the ordering is nondeterministic, and all rows are peers (which results in all rows getting the same cumulative distribution):

SELECT 
    vendorid,
    productname,
    productprice,
    cume_dist() OVER ( )
FROM products;

Result:

 vendorid |           productname           | productprice | cume_dist 
----------+---------------------------------+--------------+-----------
     1001 | Left handed screwdriver         |        25.99 |         1
     1001 | Right handed screwdriver        |        25.99 |         1
     1001 | Long Weight (blue)              |        14.75 |         1
     1001 | Long Weight (green)             |        11.99 |         1
     1002 | Sledge Hammer                   |        33.49 |         1
     1003 | Chainsaw                        |       245.00 |         1
     1003 | Straw Dog Box                   |        55.99 |         1
     1003 | Hammock                         |        10.00 |         1
     1004 | Bottomless Coffee Mugs (4 Pack) |         9.99 |         1
     1004 | Tea Pot                         |        12.45 |         1
(10 rows)

So we can see that by omitting the ORDER BY clause, all rows now have a cumulative distribution of 1, which is not very helpful.

We haven’t applied any partitioning yet, and so the cumulative distribution is applied across the whole result set.

Partitioning

This example uses a PARTITION BY clause to partition the results by the vendor:

SELECT 
    v.vendorname,
    p.productname,
    p.productprice,
    cume_dist() OVER ( 
        PARTITION BY p.vendorid 
        ORDER BY p.productprice 
        )
FROM products p INNER JOIN 
vendors v ON p.vendorid = v.vendorid;

Result:

  vendorname   |           productname           | productprice |     cume_dist      
---------------+---------------------------------+--------------+--------------------
 Mars Supplies | Long Weight (green)             |        11.99 |               0.25
 Mars Supplies | Long Weight (blue)              |        14.75 |                0.5
 Mars Supplies | Right handed screwdriver        |        25.99 |                  1
 Mars Supplies | Left handed screwdriver         |        25.99 |                  1
 Randy Roofers | Sledge Hammer                   |        33.49 |                  1
 Pedal Medals  | Hammock                         |        10.00 | 0.3333333333333333
 Pedal Medals  | Straw Dog Box                   |        55.99 | 0.6666666666666666
 Pedal Medals  | Chainsaw                        |       245.00 |                  1
 Katty Kittens | Bottomless Coffee Mugs (4 Pack) |         9.99 |                0.5
 Katty Kittens | Tea Pot                         |        12.45 |                  1
(10 rows)

This time the cumulative distribution values are constrained to each partition. Therefore, they reset with each partition.

In this case I performed a SQL join on the vendors table in order to return the vendor name instead of the vendor ID.

Changing the Order

Here’s what happens when we change the order to descending order:

SELECT 
    v.vendorname,
    p.productname,
    p.productprice,
    cume_dist() OVER ( 
        PARTITION BY p.vendorid 
        ORDER BY p.productprice DESC
        )
FROM products p INNER JOIN 
vendors v ON p.vendorid = v.vendorid;

Result:

  vendorname   |           productname           | productprice |     cume_dist      
---------------+---------------------------------+--------------+--------------------
 Mars Supplies | Left handed screwdriver         |        25.99 |                0.5
 Mars Supplies | Right handed screwdriver        |        25.99 |                0.5
 Mars Supplies | Long Weight (blue)              |        14.75 |               0.75
 Mars Supplies | Long Weight (green)             |        11.99 |                  1
 Randy Roofers | Sledge Hammer                   |        33.49 |                  1
 Pedal Medals  | Chainsaw                        |       245.00 | 0.3333333333333333
 Pedal Medals  | Straw Dog Box                   |        55.99 | 0.6666666666666666
 Pedal Medals  | Hammock                         |        10.00 |                  1
 Katty Kittens | Tea Pot                         |        12.45 |                0.5
 Katty Kittens | Bottomless Coffee Mugs (4 Pack) |         9.99 |                  1
(10 rows)

The data within each partition is returned in a different order, and the rows are assigned their cumulative distribution based on that order, which in some cases means that a product has a different cumulative distribution than in the previous example. For instance, both screwdrivers now get a cumulative distribution of 0.5, whereas they were previously 1.

The OVER Clause is Required

We must provide an OVER clause when using the cume_dist() function. This determines how the rowset is partitioned and ordered before the window function is applied.

Omitting the OVER clause results in an error:

SELECT 
    vendorid,
    productname,
    productprice,
    cume_dist()
FROM products;

Result:

ERROR:  window function cume_dist requires an OVER clause