In MySQL, CUME_DIST()
is a window function that returns the cumulative distribution of a value within a group of values. This is the percentage of partition values less than or equal to the value in the current row. The return values range from 0 to 1.
Syntax
The syntax goes like this:
CUME_DIST() over_clause
So it requires an OVER
clause, which determines how the rowset is partitioned and ordered before the window function is applied.
Example
Suppose we have a table with the following data:
SELECT * FROM Idiots;
Result:
+---------+-------------------+------+ | IdiotId | IdiotName | IQ | +---------+-------------------+------+ | 1 | Dumb | 60 | | 2 | Dumber | 50 | | 3 | Dumbest | 30 | | 4 | Homer | 40 | | 5 | Peter Griffin | 50 | | 6 | Patrick Star | 40 | | 7 | Ed | 40 | | 8 | Ralph Wiggum | 65 | | 9 | Cosmo | 55 | | 10 | Richard Watterson | 70 | +---------+-------------------+------+
The following query uses CUME_DIST()
to get the cumulative distribution of the IQ column:
SELECT
IdiotName,
IQ,
CUME_DIST() OVER ( ORDER BY IQ ) AS "Cumulative Distribution"
FROM Idiots;
Result:
+-------------------+------+-------------------------+ | IdiotName | IQ | Cumulative Distribution | +-------------------+------+-------------------------+ | Dumbest | 30 | 0.1 | | Homer | 40 | 0.4 | | Patrick Star | 40 | 0.4 | | Ed | 40 | 0.4 | | Dumber | 50 | 0.6 | | Peter Griffin | 50 | 0.6 | | Cosmo | 55 | 0.7 | | Dumb | 60 | 0.8 | | Ralph Wiggum | 65 | 0.9 | | Richard Watterson | 70 | 1 | +-------------------+------+-------------------------+
We can see that the cumulative distribution ranges between 0.1 and 1. As mentioned, the function returns values in the range 0 to 1.
The cumulative distribution represents the number of rows preceding or peer with the current row in the window ordering of the window partition divided by the total number of rows in the window partition.
Here it is alongside the PERCENT_RANK()
function:
SELECT
IdiotName,
IQ,
CUME_DIST() OVER win AS "Cumulative Distribution",
PERCENT_RANK() OVER win AS "Percentage Rank"
FROM Idiots
WINDOW win AS ( ORDER BY IQ );
Result:
+-------------------+------+-------------------------+--------------------+ | IdiotName | IQ | Cumulative Distribution | Percentage Rank | +-------------------+------+-------------------------+--------------------+ | Dumbest | 30 | 0.1 | 0 | | Homer | 40 | 0.4 | 0.1111111111111111 | | Patrick Star | 40 | 0.4 | 0.1111111111111111 | | Ed | 40 | 0.4 | 0.1111111111111111 | | Dumber | 50 | 0.6 | 0.4444444444444444 | | Peter Griffin | 50 | 0.6 | 0.4444444444444444 | | Cosmo | 55 | 0.7 | 0.6666666666666666 | | Dumb | 60 | 0.8 | 0.7777777777777778 | | Ralph Wiggum | 65 | 0.9 | 0.8888888888888888 | | Richard Watterson | 70 | 1 | 1 | +-------------------+------+-------------------------+--------------------+
So it follows a similar pattern, but with different values.
Given this example uses two window functions, I decided to use a named window. That way, I could simply reference that named window in my OVER
clauses (I defined the named window in the WINDOW
clause later in the query). That saved me from having to duplicate the ORDER BY
clause for both window functions. I was able to put the ORDER BY
clause into the named window, then reference that named window from both window functions.
Omitting the ORDER BY
Clause
The CUME_DIST()
function should be used with the ORDER BY
clause. Omitting the ORDER BY
clause will cause all rows are peers and to have the value N
/N
= 1, where N
is the partition size.
Here’s what happens when we omit the ORDER BY
clause from the above example:
SELECT
IdiotName,
IQ,
CUME_DIST() OVER () AS "Cumulative Distribution",
PERCENT_RANK() OVER () AS "Percentage Rank"
FROM Idiots;
Result:
+-------------------+------+-------------------------+-----------------+ | IdiotName | IQ | Cumulative Distribution | Percentage Rank | +-------------------+------+-------------------------+-----------------+ | Dumb | 60 | 1 | 0 | | Dumber | 50 | 1 | 0 | | Dumbest | 30 | 1 | 0 | | Homer | 40 | 1 | 0 | | Peter Griffin | 50 | 1 | 0 | | Patrick Star | 40 | 1 | 0 | | Ed | 40 | 1 | 0 | | Ralph Wiggum | 65 | 1 | 0 | | Cosmo | 55 | 1 | 0 | | Richard Watterson | 70 | 1 | 0 | +-------------------+------+-------------------------+-----------------+
All rows contain the same value (1
) for the Cumulative Distribution
column. The same is true for the Percentage Rank
column (although that column returns 0
instead of 1
).