In PostgreSQL, we can use the cume_dist()
function to return the cumulative distribution of a value within a group of values.
It calculates this as follows: (the number of partition rows preceding or peers with current row) / (total partition rows).
The return value ranges from 1/N
to 1.
Syntax
The syntax goes like this:
cume_dist() over_clause
The OVER
clause determines how the rowset is partitioned and ordered before the window function is applied.
Example
Here’s a basic example to demonstrate how it works:
SELECT
vendorid,
productname,
productprice,
cume_dist() OVER ( ORDER BY productprice )
FROM products;
Result:
vendorid | productname | productprice | cume_dist ----------+---------------------------------+--------------+----------- 1004 | Bottomless Coffee Mugs (4 Pack) | 9.99 | 0.1 1003 | Hammock | 10.00 | 0.2 1001 | Long Weight (green) | 11.99 | 0.3 1004 | Tea Pot | 12.45 | 0.4 1001 | Long Weight (blue) | 14.75 | 0.5 1001 | Left handed screwdriver | 25.99 | 0.7 1001 | Right handed screwdriver | 25.99 | 0.7 1002 | Sledge Hammer | 33.49 | 0.8 1003 | Straw Dog Box | 55.99 | 0.9 1003 | Chainsaw | 245.00 | 1 (10 rows)
In this example, both the Left handed screwdriver
and Right handed screwdriver
have the same price. Therefore, they share the same cumulative distribution (in this case, 0.7
).
The ORDER BY
clause is important when we use the cume_dist()
function. If we omit it, the ordering is nondeterministic, and all rows are peers (which results in all rows getting the same cumulative distribution):
SELECT
vendorid,
productname,
productprice,
cume_dist() OVER ( )
FROM products;
Result:
vendorid | productname | productprice | cume_dist ----------+---------------------------------+--------------+----------- 1001 | Left handed screwdriver | 25.99 | 1 1001 | Right handed screwdriver | 25.99 | 1 1001 | Long Weight (blue) | 14.75 | 1 1001 | Long Weight (green) | 11.99 | 1 1002 | Sledge Hammer | 33.49 | 1 1003 | Chainsaw | 245.00 | 1 1003 | Straw Dog Box | 55.99 | 1 1003 | Hammock | 10.00 | 1 1004 | Bottomless Coffee Mugs (4 Pack) | 9.99 | 1 1004 | Tea Pot | 12.45 | 1 (10 rows)
So we can see that by omitting the ORDER BY
clause, all rows now have a cumulative distribution of 1
, which is not very helpful.
We haven’t applied any partitioning yet, and so the cumulative distribution is applied across the whole result set.
Partitioning
This example uses a PARTITION BY
clause to partition the results by the vendor:
SELECT
v.vendorname,
p.productname,
p.productprice,
cume_dist() OVER (
PARTITION BY p.vendorid
ORDER BY p.productprice
)
FROM products p INNER JOIN
vendors v ON p.vendorid = v.vendorid;
Result:
vendorname | productname | productprice | cume_dist ---------------+---------------------------------+--------------+-------------------- Mars Supplies | Long Weight (green) | 11.99 | 0.25 Mars Supplies | Long Weight (blue) | 14.75 | 0.5 Mars Supplies | Right handed screwdriver | 25.99 | 1 Mars Supplies | Left handed screwdriver | 25.99 | 1 Randy Roofers | Sledge Hammer | 33.49 | 1 Pedal Medals | Hammock | 10.00 | 0.3333333333333333 Pedal Medals | Straw Dog Box | 55.99 | 0.6666666666666666 Pedal Medals | Chainsaw | 245.00 | 1 Katty Kittens | Bottomless Coffee Mugs (4 Pack) | 9.99 | 0.5 Katty Kittens | Tea Pot | 12.45 | 1 (10 rows)
This time the cumulative distribution values are constrained to each partition. Therefore, they reset with each partition.
In this case I performed a SQL join on the vendors
table in order to return the vendor name instead of the vendor ID.
Changing the Order
Here’s what happens when we change the order to descending order:
SELECT
v.vendorname,
p.productname,
p.productprice,
cume_dist() OVER (
PARTITION BY p.vendorid
ORDER BY p.productprice DESC
)
FROM products p INNER JOIN
vendors v ON p.vendorid = v.vendorid;
Result:
vendorname | productname | productprice | cume_dist ---------------+---------------------------------+--------------+-------------------- Mars Supplies | Left handed screwdriver | 25.99 | 0.5 Mars Supplies | Right handed screwdriver | 25.99 | 0.5 Mars Supplies | Long Weight (blue) | 14.75 | 0.75 Mars Supplies | Long Weight (green) | 11.99 | 1 Randy Roofers | Sledge Hammer | 33.49 | 1 Pedal Medals | Chainsaw | 245.00 | 0.3333333333333333 Pedal Medals | Straw Dog Box | 55.99 | 0.6666666666666666 Pedal Medals | Hammock | 10.00 | 1 Katty Kittens | Tea Pot | 12.45 | 0.5 Katty Kittens | Bottomless Coffee Mugs (4 Pack) | 9.99 | 1 (10 rows)
The data within each partition is returned in a different order, and the rows are assigned their cumulative distribution based on that order, which in some cases means that a product has a different cumulative distribution than in the previous example. For instance, both screwdrivers now get a cumulative distribution of 0.5
, whereas they were previously 1
.
The OVER
Clause is Required
We must provide an OVER
clause when using the cume_dist()
function. This determines how the rowset is partitioned and ordered before the window function is applied.
Omitting the OVER
clause results in an error:
SELECT
vendorid,
productname,
productprice,
cume_dist()
FROM products;
Result:
ERROR: window function cume_dist requires an OVER clause