Using DuckDB’s FAVG() Function for More Accurate Floating Point Averages

The favg() function in DuckDB calculates averages using a floating point summation method known as Kahan summation (or compensated summation). This method helps reduce the accumulation of rounding errors that can occur when many floating point numbers are involved when using the regular avg() function.

What is Kahan Summation?

Kahan summation, developed by William Kahan, is an algorithm that reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers. The algorithm compensates for rounding errors by keeping a separate “running compensation” term.

Syntax

The basic syntax goes like this:

favg(arg)

Where arg is typically a column for which we want to calculate the average value of.

Example 1: Basic Example

Suppose we create and populate the following table:

CREATE TABLE scientific_readings (
    id INT,
    measurement DOUBLE
);

INSERT INTO scientific_readings VALUES
    (1, 0.000000001),
    (2, 0.000000002),
    (3, 1000000.0),
    (4, 0.000000001),
    (5, 0.000000003);

We can use the favg() function to get the average value in the measurement column:

SELECT
    favg(measurement)
FROM scientific_readings;

Result:

+-------------------+
| favg(measurement) |
+-------------------+
| 200000.0000000014 |
+-------------------+

Example 2: Comparing `favg()` and `avg()`

Let’s compare the results of favg() to those of the avg() function:

SELECT
    avg(measurement) AS standard_avg,
    favg(measurement) AS kahan_avg,
    favg(measurement) - avg(measurement) AS difference
FROM scientific_readings;

Result:

+--------------------+-------------------+-------------------------+
|    standard_avg    |     kahan_avg     |       difference        |
+--------------------+-------------------+-------------------------+
| 200000.00000000143 | 200000.0000000014 | -2.9103830456733704e-11 |
+--------------------+-------------------+-------------------------+

We can see that we get a slightly different result, depending on which function we use. The difference might be small but can be significant in scientific computations or financial calculations.

Example 3: Using a Larger Dataset

Let’s create another table, but this time we’ll insert thousands of rows:

-- Create a table
CREATE TABLE measurements (
    id INT,
    value DOUBLE
);

-- Insert reciprocal values (1/1, 1/2, 1/3, etc.)
INSERT INTO measurements 
SELECT 
    generate_series AS id,
    1.0/generate_series AS value
FROM generate_series(1, 10000);

-- Check the first 15 rows
SELECT * FROM measurements LIMIT 15;

Output:

+----+---------------------+
| id |        value        |
+----+---------------------+
| 1  | 1.0                 |
| 2  | 0.5                 |
| 3  | 0.3333333333333333  |
| 4  | 0.25                |
| 5  | 0.2                 |
| 6  | 0.16666666666666666 |
| 7  | 0.14285714285714285 |
| 8  | 0.125               |
| 9  | 0.1111111111111111  |
| 10 | 0.1                 |
| 11 | 0.09090909090909091 |
| 12 | 0.08333333333333333 |
| 13 | 0.07692307692307693 |
| 14 | 0.07142857142857142 |
| 15 | 0.06666666666666667 |
+----+---------------------+

The table contains 10,000 rows of data, but that’s just the top fifteen.

Now let’s run a query that calculates the average of all the values in the above table:

SELECT 
    avg(value) AS standard_avg,
    favg(value) AS kahan_avg,
    favg(value) - avg(value) AS difference
FROM measurements;

Result:

+-----------------------+----------------------+------------------------+
|     standard_avg      |      kahan_avg       |       difference       |
+-----------------------+----------------------+------------------------+
| 0.0009787606036044348 | 0.000978760603604438 | 3.2526065174565133e-18 |
+-----------------------+----------------------+------------------------+

Again favg() and avg() returned slightly different results.

Example 4: Using `favg()` with Window Functions

Here’s an example that uses a window function:

SELECT 
    id,
    value,
    avg(value) OVER (ORDER BY id) AS running_avg,
    favg(value) OVER (ORDER BY id) AS accurate_running_avg
FROM measurements
WHERE id <= 10
ORDER BY id;

Result:

+----+---------------------+---------------------+----------------------+
| id |        value        |     running_avg     | accurate_running_avg |
+----+---------------------+---------------------+----------------------+
| 1  | 1.0                 | 1.0                 | 1.0                  |
| 2  | 0.5                 | 0.75                | 0.75                 |
| 3  | 0.3333333333333333  | 0.611111111111111   | 0.611111111111111    |
| 4  | 0.25                | 0.5208333333333333  | 0.5208333333333334   |
| 5  | 0.2                 | 0.45666666666666667 | 0.45666666666666667  |
| 6  | 0.16666666666666666 | 0.40833333333333327 | 0.40833333333333344  |
| 7  | 0.14285714285714285 | 0.37040816326530607 | 0.3704081632653061   |
| 8  | 0.125               | 0.3397321428571428  | 0.33973214285714287  |
| 9  | 0.1111111111111111  | 0.31432980599647264 | 0.3143298059964727   |
| 10 | 0.1                 | 0.2928968253968254  | 0.2928968253968254   |
+----+---------------------+---------------------+----------------------+

In this case, some of the results returned by avg() and favg() are the same and some aren’t.

Example 5: When `avg()` and `favg()` Return the Same Result

As we can see by the previous example, it’s not always the case that avg() and favg() will return different results. It depends on the data.

Here’s another example that shows them both returning the same result:

CREATE TABLE t1 (
    value DOUBLE
);

INSERT INTO t1 VALUES 
    (1.0),
    (0.3333333333333333),
    (0.14285714285714285);

SELECT 
    avg(value) AS standard_avg,
    favg(value) AS kahan_avg,
    favg(value) - avg(value) AS difference
FROM t1;

Result:

+---------------------+---------------------+------------+
|    standard_avg     |      kahan_avg      | difference |
+---------------------+---------------------+------------+
| 0.49206349206349204 | 0.49206349206349204 | 0.0        |
+---------------------+---------------------+------------+

When to Use `favg()`

You should consider using favg() instead of avg() when:

Working with datasets containing many floating point numbers
Dealing with scientific computations requiring high precision
Processing financial data where small rounding errors could accumulate
Computing averages of values with widely different magnitudes

Conclusion

While avg() is suitable for many use cases, favg() provides higher precision when dealing with floating point numbers. The difference becomes particularly important when:

Getting the average across many numbers
Working with values of significantly different magnitudes
Dealing with applications requiring high numerical precision

The trade-off is slightly increased computational complexity for better numerical accuracy.

In summary, if you need high precision when calculating averages in DuckDB, try using favg() instead of avg().

What is Kahan Summation?

Syntax

Example 1: Basic Example

Example 2: Comparing favg() and avg()

Example 3: Using a Larger Dataset

Example 4: Using favg() with Window Functions

Example 5: When avg() and favg() Return the Same Result

When to Use favg()

Conclusion

Example 2: Comparing `favg()` and `avg()`

Example 4: Using `favg()` with Window Functions

Example 5: When `avg()` and `favg()` Return the Same Result

When to Use `favg()`