Aggregating Uniques in Snowflake
In pursuit of fast queries, we often want to pre-compute aggregated metrics, but this can be a challenge with counting unique values since they can’t be summed (e.g. today’s unique count + yesterday’s unique count cannot be deduplicated). How can a data model support daily uniques and monthly uniques without storing a list of all the unique values?
Well, if perfect accuracy isn’t a requirement, then we can use HyperLogLog++ to pre-compute the accumulated state, then combine + estimate at query time.
For example, given an event level table that we want to aggregate a daily data model from which we can also calculate uniques, we’d do something like the following:
-- Event Level table select * from values ('2023-01-01', 'user1'), ('2023-01-01', 'user2') as t(date, user_id) -- Data Model with daily aggregation select date, hll_accumulate(user_id) as hll_a from events group by all -- Example query with monthly aggregation select date_trunc(month, date) as period, hll_estimate(hll_combine(hll_a)) as unique_users from events_daily group by all
This is fast and HLL++ is accurate to within 1-2% even for small cardinality! Problem solved.