How APPROX_COUNT_DISTINCT() Works in SQL Server

APPROX_COUNT_DISTINCT() is one of the new functions introduced in SQL Server 2019. This function returns the approximate number of unique non-null values in a group.

Basically, you can use it to get an approximate idea of the number of non-duplicate rows in a large table or result set. It works similar to the COUNT_BIG() and COUNT() functions (when using the DISTINCT clause), but it returns an approximate number rather than a precise number.

APPROX_COUNT_DISTINCT() is aimed mainly at big data scenarios. It’s designed for accessing large data sets with more than a million rows, and aggregation of a column or columns that have many distinct values. It is intended for scenarios where responsiveness is more critical than absolute precision.

Microsoft states that the function implementation guarantees up to a 2% error rate within a 97% probability.

At the time of writing, APPROX_COUNT_DISTINCT() is a public preview feature. It was introduced in SQL Server 2019, which is also currently in preview status.

Note that Microsoft states that preview features are not intended for production use.

Syntax

The syntax goes like this:

APPROX_COUNT_DISTINCT ( expression )

The expression can be of any type, except image, sql_variant, ntext, or text.

Example 1 – COUNT() vs APPROX_COUNT_DISTINCT

Here’s a basic example comparing COUNT() with APPROX_COUNT_DISTINCT():

USE WideWorldImporters;
SELECT 
  COUNT(OrderLineId) 'Actual Count',
  COUNT(DISTINCT OrderLineId) 'Actual Distinct Count',
  APPROX_COUNT_DISTINCT(OrderLineId) 'Approx Distinct Count'
FROM Sales.OrderLines;

Result:

+----------------+-------------------------+-------------------------+
| Actual Count   | Actual Distinct Count   | Approx Distinct Count   |
|----------------+-------------------------+-------------------------|
| 231412         | 231412                  | 238493                  |
+----------------+-------------------------+-------------------------+

In this case, the actual count and the actual distinct count is the same (this just means that there were no duplicates in the OrderLineId column).

However, we see that APPROX_COUNT_DISTINCT() returned a different value. This is to be expected, as it only returns an approximation.

Example 2 – A Smaller Number

In this example, I specify a different column (Description) to count:

SELECT 
  COUNT(Description) 'Actual Count',
  COUNT(DISTINCT Description) 'Actual Distinct Count',
  APPROX_COUNT_DISTINCT(Description) 'Approx Distinct Count'
FROM Sales.OrderLines;

Result:

+----------------+-------------------------+-------------------------+
| Actual Count   | Actual Distinct Count   | Approx Distinct Count   |
|----------------+-------------------------+-------------------------|
| 231412         | 227                     | 226                     |
+----------------+-------------------------+-------------------------+

In this case, the actual count and the actual distinct count is different. This is because the Description column contains a lot of duplicate values.

We can see that APPROX_COUNT_DISTINCT() still returned a different value, but it’s pretty close.

As mentioned, APPROX_COUNT_DISTINCT() is mainly intended for larger result sets. Smaller result sets like the ones here run quickly regardless of which function I use.

Check the Data Type

APPROX_COUNT_DISTINCT() returns its result as a bigint, so in that regard it’s more similar to COUNT_BIG() than it is to COUNT() (which returns an int). But let’s confirm that:

EXEC sp_describe_first_result_set N'SELECT APPROX_COUNT_DISTINCT(OrderLineId) FROM Sales.OrderLines', null, 0;

Result (using vertical output):

is_hidden                    | 0
column_ordinal               | 1
name                         | NULL
is_nullable                  | 1
system_type_id               | 127
system_type_name             | bigint
max_length                   | 8
precision                    | 19
scale                        | 0
collation_name               | NULL
user_type_id                 | NULL
user_type_database           | NULL
user_type_schema             | NULL
user_type_name               | NULL
assembly_qualified_type_name | NULL
xml_collection_id            | NULL
xml_collection_database      | NULL
xml_collection_schema        | NULL
xml_collection_name          | NULL
is_xml_document              | 0
is_case_sensitive            | 0
is_fixed_length_clr_type     | 0
source_server                | NULL
source_database              | NULL
source_schema                | NULL
source_table                 | NULL
source_column                | NULL
is_identity_column           | 0
is_part_of_unique_key        | NULL
is_updateable                | 0
is_computed_column           | 0
is_sparse_column_set         | 0
ordinal_in_order_by_list     | NULL
order_by_is_descending       | NULL
order_by_list_length         | NULL
tds_type_id                  | 38
tds_length                   | 8
tds_collation_id             | NULL
tds_collation_sort_id        | NULL

We can see that system_type_name is bigint. This tells us that our query returns its results as a bigint data type, as expected. The max_length and precision values are also consistent with the bigint data type.