Most RDBMSs have functions that make it easy to calculate the median value from a column in our queries.
Below are three SQL functions we can use to calculate the median value in a column.
The MEDIAN()
Function
Some DBMSs (such as MariaDB, Oracle Database, DB2, Snowflake, Amazon Redshift) provide a MEDIAN()
function specifically for calculating the median value. If your DBMS has a MEDIAN()
function, then your work is done. All you need to do is pass the name of the column for which you want the median value calculated.
Here’s an example I ran in MariaDB:
SELECT
VendorId,
ProductName,
ProductPrice,
MEDIAN(ProductPrice) OVER(PARTITION BY VendorId) AS "Median"
FROM Products;
Result:
VendorId ProductName ProductPrice Median -------------- ------------------------------ -------------- ------------- 1001 Left handed screwdriver 25.99 20.3700000000 1001 Right handed screwdriver 25.99 20.3700000000 1001 Long Weight (blue) 14.75 20.3700000000 1001 Long Weight (green) 11.99 20.3700000000 1002 Sledge Hammer 33.49 33.4900000000 1003 Chainsaw 245.00 55.9900000000 1003 Straw Dog Box 55.99 55.9900000000 1003 Hammock 10.00 55.9900000000 1004 Tea Pot 12.45 11.2200000000 1004 Bottomless Coffee Mugs (4 Pack) 9.99 11.2200000000
You may have noticed that the median value returned is not necessarily in the actual underlying data. That’s because the MEDIAN()
function returns an interpolated value if required. In other words, it calculates the median and returns that value, regardless of whether that value is actually in the underlying data.
We can see an example of this with the first vendor. Here, the median
column contains 20.3700000000
, but that value isn’t actually in the data set. Same with the last vendor. Its median
column contains 11.2200000000
, but that value isn’t in the underlying data.
If you don’t want an interpolated value, then use the PERCENTILE_DISC()
function if your DBMS has it (example of this function later).
The PERCENTILE_CONT()
Function
If your DBMS doesn’t have a MEDIAN()
function, perhaps it has a PERCENTILE_CONT()
function. If it does, you can use it to calculate the median value.
The PERCENTILE_CONT()
function accepts a numeric argument between 0.0
and 1.0
. Providing an argument of 0.5
is what we need to do to get the median.
Here’s an example I ran in SQL Server:
SELECT
VendorId,
ProductName,
ProductPrice,
PERCENTILE_CONT( 0.5 )
WITHIN GROUP (
ORDER BY ProductPrice
)
OVER (
PARTITION BY VendorId
) AS PERCENTILE_CONT
FROM Products;
Result:
VendorId ProductName ProductPrice PERCENTILE_CONT -------- ------------------------------- ------------ ------------------ 1001 Long Weight (green) 11.99 20.369999999999997 1001 Long Weight (blue) 14.75 20.369999999999997 1001 Left handed screwdriver 25.99 20.369999999999997 1001 Right handed screwdriver 25.99 20.369999999999997 1002 Sledge Hammer 33.49 33.49 1003 Hammock 10 55.99 1003 Straw Dog Box 55.99 55.99 1003 Chainsaw 245 55.99 1004 Bottomless Coffee Mugs (4 Pack) 9.99 11.219999999999999 1004 Tea Pot 12.45 11.219999999999999
An argument of 0.5
was used to calculate the median price of all products from each vendor. This is important – any other value and it wouldn’t be the median.
I used the WITHIN GROUP
clause to sort the set by product price, and I used the OVER
clause with a PARTITION BY
clause to partition the results by vendor. Removing the PARTITION BY
clause will calculate the median value across the whole data set.
Like the MEDIAN()
function above, the PERCENTILE_CONT()
function doesn’t necessarily return a value from the actual data set. It takes the percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification.
As an Aggregate Function
Depending on your DBMS, you may or may not be able to run PERCENTILE_CONT()
as a window function (like in the above example). In PostgreSQL for example, the PERCENTILE_CONT()
function is implemented as an aggregate function, but not as a window function, and it doesn’t support the OVER
clause (at least, not at the time of writing).
Here’s an example of using the PERCENTILE_CONT()
function in PostgreSQL:
SELECT
PERCENTILE_CONT( 0.5 )
WITHIN GROUP (
ORDER BY ProductPrice
) AS Median
FROM Products;
Result:
median -------------------- 20.369999999999997
So in this case it’s used as an aggregate function to return the median value of the whole data set.
Of course, we can also narrow it down to just a single vendor:
SELECT
PERCENTILE_CONT( 0.5 )
WITHIN GROUP (
ORDER BY ProductPrice
) AS Median
FROM Products
WHERE VendorId = 1004;
Result:
median -------------------- 11.219999999999999
And with a bit of work, you may be able to modify the query to return results similar to the window function.
Some DBMSs (such as DB2, Oracle Database, SingleStoreDB) support the OVER
clause as optional. In this case, we have the choice of using PERCENTILE_CONT()
as a window function or aggregate function.
DBMSs that implement a PERCENTILE_CONT()
function include SQL Server, Oracle Database, PostgreSQL, MariaDB, DB2, Snowflake, Amazon Redshift, SingleStoreDB, and more.
The PERCENTILE_DISC()
Function
If your RDBMS supports the PERCENTILE_CONT()
function, it most likely also includes a PERCENTILE_DISC()
function that works in a similar way to PERCENTILE_CONT()
, except that it always returns an actual value from the data set.
Example:
SELECT
VendorId,
ProductName,
ProductPrice,
PERCENTILE_DISC( 0.5 )
WITHIN GROUP (
ORDER BY ProductPrice
)
OVER (
PARTITION BY VendorId
) AS PERCENTILE_DISC
FROM Products;
Result:
VendorId ProductName ProductPrice PERCENTILE_DISC -------- ------------------------------- ------------ --------------- 1001 Long Weight (green) 11.99 14.75 1001 Long Weight (blue) 14.75 14.75 1001 Left handed screwdriver 25.99 14.75 1001 Right handed screwdriver 25.99 14.75 1002 Sledge Hammer 33.49 33.49 1003 Hammock 10 55.99 1003 Straw Dog Box 55.99 55.99 1003 Chainsaw 245 55.99 1004 Bottomless Coffee Mugs (4 Pack) 9.99 9.99 1004 Tea Pot 12.45 9.99
Here, I used the same code from the previous example, except that I switched PERCENTILE_CONT()
to PERCENTILE_DISC()
. We can see that PERCENTILE_DISC()
doesn’t interpolate any values. It simply uses whatever values are already in the data set.
What is the Median Value?
The median is the value in the middle of a data set, meaning that half of the values are smaller or equal to the median and half are higher or equal to the median. This is not to be confused with the average value, which can be vastly different to the median. To calculate the average, use the AVG()
function instead.