Nat TaylorBlog, AI, Product Management & Tinkering

Aggregating Uniques in Snowflake

Published on .

In pursuit of fast queries, we often want to pre-compute aggregated metrics, but this can be a challenge with counting unique values since they can’t be summed (e.g. today’s unique count + yesterday’s unique count cannot be deduplicated). How can a data model support daily uniques and monthly uniques without storing a list of all the unique values?

Well, if perfect accuracy isn’t a requirement, then we can use HyperLogLog++ to pre-compute the accumulated state, then combine + estimate at query time.

For example, given an event level table that we want to aggregate a daily data model from which we can also calculate uniques, we’d do something like the following:

-- Event Level table
select *
from values
  ('2023-01-01', 'user1'),
  ('2023-01-01', 'user2') as t(date, user_id)

-- Data Model with daily aggregation
select date, hll_accumulate(user_id) as hll_a from events group by all

-- Example query with monthly aggregation
select date_trunc(month, date) as period, hll_estimate(hll_combine(hll_a)) as unique_users
from events_daily
group by all

This is fast and HLL++ is accurate to within 1-2% even for small cardinality! Problem solved.

Post Navigation

«
»