Understanding the CUME_DIST() Function in MySQL

In MySQL, CUME_DIST() is a window function that returns the cumulative distribution of a value within a group of values. This is the percentage of partition values less than or equal to the value in the current row. The return values range from 0 to 1.

Syntax

The syntax goes like this:

CUME_DIST() over_clause

So it requires an OVER clause, which determines how the rowset is partitioned and ordered before the window function is applied.

Example

Suppose we have a table with the following data:

SELECT * FROM Idiots;

Result:

+---------+-------------------+------+
| IdiotId | IdiotName         | IQ   |
+---------+-------------------+------+
|       1 | Dumb              |   60 |
|       2 | Dumber            |   50 |
|       3 | Dumbest           |   30 |
|       4 | Homer             |   40 |
|       5 | Peter Griffin     |   50 |
|       6 | Patrick Star      |   40 |
|       7 | Ed                |   40 |
|       8 | Ralph Wiggum      |   65 |
|       9 | Cosmo             |   55 |
|      10 | Richard Watterson |   70 |
+---------+-------------------+------+

The following query uses CUME_DIST() to get the cumulative distribution of the IQ column:

SELECT 
    IdiotName,
    IQ,
    CUME_DIST() OVER ( ORDER BY IQ ) AS "Cumulative Distribution"
FROM Idiots;

Result:

+-------------------+------+-------------------------+
| IdiotName         | IQ   | Cumulative Distribution |
+-------------------+------+-------------------------+
| Dumbest           |   30 |                     0.1 |
| Homer             |   40 |                     0.4 |
| Patrick Star      |   40 |                     0.4 |
| Ed                |   40 |                     0.4 |
| Dumber            |   50 |                     0.6 |
| Peter Griffin     |   50 |                     0.6 |
| Cosmo             |   55 |                     0.7 |
| Dumb              |   60 |                     0.8 |
| Ralph Wiggum      |   65 |                     0.9 |
| Richard Watterson |   70 |                       1 |
+-------------------+------+-------------------------+

We can see that the cumulative distribution ranges between 0.1 and 1. As mentioned, the function returns values in the range 0 to 1.

The cumulative distribution represents the number of rows preceding or peer with the current row in the window ordering of the window partition divided by the total number of rows in the window partition.

Here it is alongside the PERCENT_RANK() function:

SELECT 
    IdiotName,
    IQ,
    CUME_DIST() OVER win AS "Cumulative Distribution",
    PERCENT_RANK() OVER win AS "Percentage Rank"
FROM Idiots
WINDOW win AS ( ORDER BY IQ );

Result:

+-------------------+------+-------------------------+--------------------+
| IdiotName         | IQ   | Cumulative Distribution | Percentage Rank    |
+-------------------+------+-------------------------+--------------------+
| Dumbest           |   30 |                     0.1 |                  0 |
| Homer             |   40 |                     0.4 | 0.1111111111111111 |
| Patrick Star      |   40 |                     0.4 | 0.1111111111111111 |
| Ed                |   40 |                     0.4 | 0.1111111111111111 |
| Dumber            |   50 |                     0.6 | 0.4444444444444444 |
| Peter Griffin     |   50 |                     0.6 | 0.4444444444444444 |
| Cosmo             |   55 |                     0.7 | 0.6666666666666666 |
| Dumb              |   60 |                     0.8 | 0.7777777777777778 |
| Ralph Wiggum      |   65 |                     0.9 | 0.8888888888888888 |
| Richard Watterson |   70 |                       1 |                  1 |
+-------------------+------+-------------------------+--------------------+

So it follows a similar pattern, but with different values.

Given this example uses two window functions, I decided to use a named window. That way, I could simply reference that named window in my OVER clauses (I defined the named window in the WINDOW clause later in the query). That saved me from having to duplicate the ORDER BY clause for both window functions. I was able to put the ORDER BY clause into the named window, then reference that named window from both window functions.

Omitting the ORDER BY Clause

The CUME_DIST() function should be used with the ORDER BY clause. Omitting the ORDER BY clause will cause all rows are peers and to have the value N/N = 1, where N is the partition size.

Here’s what happens when we omit the ORDER BY clause from the above example:

SELECT 
    IdiotName,
    IQ,
    CUME_DIST() OVER () AS "Cumulative Distribution",
    PERCENT_RANK() OVER () AS "Percentage Rank"
FROM Idiots;

Result:

+-------------------+------+-------------------------+-----------------+
| IdiotName         | IQ   | Cumulative Distribution | Percentage Rank |
+-------------------+------+-------------------------+-----------------+
| Dumb              |   60 |                       1 |               0 |
| Dumber            |   50 |                       1 |               0 |
| Dumbest           |   30 |                       1 |               0 |
| Homer             |   40 |                       1 |               0 |
| Peter Griffin     |   50 |                       1 |               0 |
| Patrick Star      |   40 |                       1 |               0 |
| Ed                |   40 |                       1 |               0 |
| Ralph Wiggum      |   65 |                       1 |               0 |
| Cosmo             |   55 |                       1 |               0 |
| Richard Watterson |   70 |                       1 |               0 |
+-------------------+------+-------------------------+-----------------+

All rows contain the same value (1) for the Cumulative Distribution column. The same is true for the Percentage Rank column (although that column returns 0 instead of 1).