Aggregating Uniques in Snowflake
Published on .
In pursuit of fast queries, we often want to pre-compute aggregated metrics, but this can be a challenge with counting unique values since they can’t be summed (e.g. today’s unique count + yesterday’s unique count cannot be deduplicated). How can a data model support daily uniques and monthly uniques without storing a list of all the unique values?
Well, if perfect accuracy isn’t a requirement, then we can use HyperLogLog++ to pre-compute the accumulated state, then combine + estimate at query time.
For example, given an event level table that we want to aggregate a daily data model from which we can also calculate uniques, we’d do something like the following:
-- Event Level table
select *
from values
('2023-01-01', 'user1'),
('2023-01-01', 'user2') as t(date, user_id)
-- Data Model with daily aggregation
select date, hll_accumulate(user_id) as hll_a from events group by all
-- Example query with monthly aggregation
select date_trunc(month, date) as period, hll_estimate(hll_combine(hll_a)) as unique_users
from events_daily
group by all
Code language: Python (python)
This is fast and HLL++ is accurate to within 1-2% even for small cardinality! Problem solved.