APPROX_COUNT_DISTINCT()
is one of the new functions introduced in SQL Server 2019. This function returns the approximate number of unique non-null values in a group.
Basically, you can use it to get an approximate idea of the number of non-duplicate rows in a large table or result set. It works similar to the COUNT_BIG()
and COUNT()
functions (when using the DISTINCT
clause), but it returns an approximate number rather than a precise number.
APPROX_COUNT_DISTINCT()
is aimed mainly at big data scenarios. It’s designed for accessing large data sets with more than a million rows, and aggregation of a column or columns that have many distinct values. It is intended for scenarios where responsiveness is more critical than absolute precision.
Microsoft states that the function implementation guarantees up to a 2% error rate within a 97% probability.
At the time of writing, APPROX_COUNT_DISTINCT()
is a public preview feature. It was introduced in SQL Server 2019, which is also currently in preview status.
Note that Microsoft states that preview features are not intended for production use.
Syntax
The syntax goes like this:
APPROX_COUNT_DISTINCT ( expression )
The expression can be of any type, except image, sql_variant, ntext, or text.
Example 1 – COUNT() vs APPROX_COUNT_DISTINCT
Here’s a basic example comparing COUNT()
with APPROX_COUNT_DISTINCT()
:
USE WideWorldImporters; SELECT COUNT(OrderLineId) 'Actual Count', COUNT(DISTINCT OrderLineId) 'Actual Distinct Count', APPROX_COUNT_DISTINCT(OrderLineId) 'Approx Distinct Count' FROM Sales.OrderLines;
Result:
+----------------+-------------------------+-------------------------+ | Actual Count | Actual Distinct Count | Approx Distinct Count | |----------------+-------------------------+-------------------------| | 231412 | 231412 | 238493 | +----------------+-------------------------+-------------------------+
In this case, the actual count and the actual distinct count is the same (this just means that there were no duplicates in the OrderLineId column).
However, we see that APPROX_COUNT_DISTINCT()
returned a different value. This is to be expected, as it only returns an approximation.
Example 2 – A Smaller Number
In this example, I specify a different column (Description) to count:
SELECT COUNT(Description) 'Actual Count', COUNT(DISTINCT Description) 'Actual Distinct Count', APPROX_COUNT_DISTINCT(Description) 'Approx Distinct Count' FROM Sales.OrderLines;
Result:
+----------------+-------------------------+-------------------------+ | Actual Count | Actual Distinct Count | Approx Distinct Count | |----------------+-------------------------+-------------------------| | 231412 | 227 | 226 | +----------------+-------------------------+-------------------------+
In this case, the actual count and the actual distinct count is different. This is because the Description column contains a lot of duplicate values.
We can see that APPROX_COUNT_DISTINCT()
still returned a different value, but it’s pretty close.
As mentioned, APPROX_COUNT_DISTINCT()
is mainly intended for larger result sets. Smaller result sets like the ones here run quickly regardless of which function I use.
Check the Data Type
APPROX_COUNT_DISTINCT()
returns its result as a bigint, so in that regard it’s more similar to COUNT_BIG()
than it is to COUNT()
(which returns an int). But let’s confirm that:
EXEC sp_describe_first_result_set N'SELECT APPROX_COUNT_DISTINCT(OrderLineId) FROM Sales.OrderLines', null, 0;
Result (using vertical output):
is_hidden | 0 column_ordinal | 1 name | NULL is_nullable | 1 system_type_id | 127 system_type_name | bigint max_length | 8 precision | 19 scale | 0 collation_name | NULL user_type_id | NULL user_type_database | NULL user_type_schema | NULL user_type_name | NULL assembly_qualified_type_name | NULL xml_collection_id | NULL xml_collection_database | NULL xml_collection_schema | NULL xml_collection_name | NULL is_xml_document | 0 is_case_sensitive | 0 is_fixed_length_clr_type | 0 source_server | NULL source_database | NULL source_schema | NULL source_table | NULL source_column | NULL is_identity_column | 0 is_part_of_unique_key | NULL is_updateable | 0 is_computed_column | 0 is_sparse_column_set | 0 ordinal_in_order_by_list | NULL order_by_is_descending | NULL order_by_list_length | NULL tds_type_id | 38 tds_length | 8 tds_collation_id | NULL tds_collation_sort_id | NULL
We can see that system_type_name is bigint. This tells us that our query returns its results as a bigint data type, as expected. The max_length and precision values are also consistent with the bigint data type.