Snowflake array_agg & object_agg Performance by Nat Taylor

Snowflake array_agg & object_agg Performance

Published on Dec 05, 2022.

Recently I needed to aggregate some rows in SQL, and then check for the membership of an ID in the aggregated row. I first turned to creating a table as select using array_agg() and the querying it with array_contains() The performance was OK for a month so I thought it was done, but when I tried to backfill a year I hit a major performance bottleneck. It took over 37 minutes for a dedicated 2XL cluster to finish a CTAS statement that scanned just 36GB of data!

After some panic and head scratching, I re-implemented it with object_agg() and is_null_value() resulting in an 86x speed up of the CTAS and a 23x speed up of SELECTs. Within the CTAS, 2,000 seconds and 16 seconds were spent on processing respectively on the same 36GB of data!

I’ve included some similar toy SQL and a bunch of the query profile metrics below, but this basically came down to:

select id,array_agg(cid) from foo group by 1;
select id,object_agg(cid, 1) from foo group by 1;

(Note: I wanted to future-proof the inclusion of new cid so I didn’t want to explode out into structured columns for each cid. I did not really consider using pivot() but maybe that would have been good. My array_agg() implementation assumes that cid don’t repeat, which turns out to be not the case so array_agg(distinct cid) would have been better. I didn’t guarantee lab-like benchmarking conditions, although things are pretty similar)

Contemplating The Performance Differences

Snowflake support said the difference was due to “data skew” but I wonder if there’s more to it.

When I checked your query most of the time was spent on data skew, due to this, not all the worker’s nodes not utilized to their full extent. With the change, you made the data is evenly distributed across the nodes and so it is much faster.
— Snowflake Support

They weren’t wrong! A few of the arrays had 100s of elements.

ARRAY_SIZE(ID_ARRAY)	_ROWS
1	755,177
2	89,542
3	21,348
4	14,062
5	3,668
6	3,337
7	784
8	1,308
9	198
…	…
42	7
49	49
75	75
177	177
284	284
454	454

My suspicion is that handling of objects versus arrays also comes into play, with objects being much more efficient. It could also be the time for hashmap lookup versus array contains, but these arrays are usually short.

On objects, Snowflake says

Frequently common paths are detected, projected out, and stored in separate (typed and compressed) columns in table file
The Snowflake Elastic Data Warehouse

On arrays, Snowflake says:

For better pruning and less storage consumption, we recommend flattening your OBJECT and key data into separate relational columns if your semi-structured data includes: Arrays
Considerations for Semi-structured Data Stored in VARIANT

So we don’t know exactly what’s going on, but we can conclude that in my query:

For array_agg() we end up with a single column with difficult to compress arrays. Plus the partition stats probably don’t really work, since for an array, how do you calculate: min/max values, distinct values, sum, histogram, # nulls, dictionary, bloom filters etc.
For object_agg() we (probably) end up with many different columns of easier to compress 1s and nulls which are easy to calculate stats on too.

This is evident in the amount of data we have to scan for otherwise identical SELECT statements, which is about 3x lower for the object_agg() table.

I have come to the obvious conclusion to prefer object_agg() over array_agg() when possible!

CTAS Query Stats

Here are the metrics for the CTAS statements, where it is evident that while scanning the same amount of bytes, the array_agg() approach was 86x slower than the object_agg() approach.

It was curious to me that so much of the time was spent on the CTAS, and not on the processing that took place before hand doing the SELECT steps. Spillage is known to be slow and surely contributed here, but the biggest difference is 2,000 seconds of processing versus just 16 seconds for the same input data!

Mertic	array_agg	object_agg
Total Execution Time	2607	30
Bytes scanned	36.40GB	36.44GB
Percentage scanned from cache	5.16%	5.16%
Bytes written	2.74GB	2.58GB
Bytes sent over the network	36.65GB	33.27GB
Partitions scanned	27798	27798
Partitions total	115627	115627
Bytes spilled to local storage	31.86GB	6.91GB
Processing	84.80%	56.00%
Local Disk I/O	0.20%	1.90%
Remote Disk I/O	0	34.80%
Network Communication	0	2.10%
Synchronization	2.10%	3.00%
Initialization	12.90%	2.30%
Scan progress	100.00%	100.00%
Most Expensive Nodes	CreateTableAsSelect 80.6% Sort 6.5%	TableScan 41.7% Aggregate 18.4% CreateTableAsSelect 14.6%

CTAS Query Stats

SELECT Query Stats

Here are the stats for SELECT queries using array_contains() versus is_null_value(ID_OBJECT.test)

It’s 23x faster! It scans less data! And there’s WAY less processing.

What’s also very revealing is the columns. As promised, Snowflake does not need to scan the entire ID_OBJECT column, and instead did its thing with extracting common paths (1337 in this example) into columns and just scanning that.

Metric	array_contains	is_null_value
Total Execution Time	153	6.6
Bytes scanned	147.33MB	47.31MB
Percentage scanned from cache	0.00%	0.00%
Bytes sent over the network	1.40MB	2.45MB
Partitions scanned	81	70
Partitions total	298	256
Processing	88.70%	2.50%
Local Disk I/O	0	0.30%
Synchronization	0.40%	0
Remote Disk I/O	0	56.90%
Initialization	10.90%	40.30%
Scan progress	100.00%	100.00%
Columns	EVENT_DATE ID ID_ARRAY	EVENT_DATE MERCH_ID GET_PATH(ID_OBJECT, ‘[“1337”]’) (Extracted Variant Path)
Most Expensive Nodes	TableScan 89.1%	TableScan 59.7%

SELECT Query Stats

Clustering Info

Here’s the clustering info, where there’s a similar amount of micro-partitions but it’s evident that object_agg() approach has better (fewer) overlaps and better (less) depth. Most notably, there are a few overlaps with very high depth — so these don’t help.

cluster_by_keys	LINEAR( event_date, ARRAY_CONTAINS(CAST(1125 AS VARIANT), ID_ARRAY) )	LINEAR( event_date, coalesce(IS_NULL_VALUE(ID_OBJECT[‘1125’]),TRUE) )
total_partition_count	298	256
total_constant_partition_count	0	0
average_overlaps	3.698	1.7031
average_depth	4.0336	2.0234
partition_depth_histogram
0	0	0
1	3	4
2	238	242
3	10	10
4	6	0
5	0	0
6	0	0
7	0	0
8	0	0
9	8	0
10	0	0
11	0	0
12	0	0
13	12	0
14	0	0
15	0	0
16	0	0
32	21

Clustering Info

The following SQL is an example of what I’m talking about, but with 300e6+ rows.

WITH auctions AS (
  SELECT
    $1 AS auction_id,
    $2 AS participant_id
  FROM VALUES
    (1, 123),
    (1, 456),
    (1, 789),
    (2, 123),
    (2, 789)
), agg AS (
  SELECT
    auction_id,
    ARRAY_AGG(participant_id) AS array,
    OBJECT_AGG(participant_id, 1) AS obj
  FROM auctions
  GROUP BY
    1
)
SELECT
  auction_id,
  ARRAY_CONTAINS(CAST(456 AS VARIANT), array) AS array_contains,
  COALESCE(IS_NULL_VALUE(obj['456']), TRUE) = FALSE AS object_key
FROM agg;

Nat Taylor — Blog, AI, Product Management & Tinkering