Snowflake array_agg & object_agg Performance
Published on .
Recently I needed to aggregate some rows in SQL, and then check for the membership of an ID in the aggregated row. I first turned to creating a table as select using array_agg()
and the querying it with array_contains()
The performance was OK for a month so I thought it was done, but when I tried to backfill a year I hit a major performance bottleneck. It took over 37 minutes for a dedicated 2XL cluster to finish a CTAS statement that scanned just 36GB of data!
After some panic and head scratching, I re-implemented it with object_agg()
and is_null_value()
resulting in an 86x speed up of the CTAS and a 23x speed up of SELECTs. Within the CTAS, 2,000 seconds and 16 seconds were spent on processing respectively on the same 36GB of data!
I’ve included some similar toy SQL and a bunch of the query profile metrics below, but this basically came down to:
select id,array_agg(cid) from foo group by 1
;select id,object_agg(cid, 1) from foo group by 1;
(Note: I wanted to future-proof the inclusion of new cid
so I didn’t want to explode out into structured columns for each cid
. I did not really consider using pivot()
but maybe that would have been good. My array_agg()
implementation assumes that cid
don’t repeat, which turns out to be not the case so array_agg(distinct cid)
would have been better. I didn’t guarantee lab-like benchmarking conditions, although things are pretty similar)
Contemplating The Performance Differences
Snowflake support said the difference was due to “data skew” but I wonder if there’s more to it.
When I checked your query most of the time was spent on data skew, due to this, not all the worker’s nodes not utilized to their full extent. With the change, you made the data is evenly distributed across the nodes and so it is much faster.
— Snowflake Support
They weren’t wrong! A few of the arrays had 100s of elements.
ARRAY_SIZE(ID_ARRAY) | _ROWS |
1 | 755,177 |
2 | 89,542 |
3 | 21,348 |
4 | 14,062 |
5 | 3,668 |
6 | 3,337 |
7 | 784 |
8 | 1,308 |
9 | 198 |
… | … |
42 | 7 |
49 | 49 |
75 | 75 |
177 | 177 |
284 | 284 |
454 | 454 |
My suspicion is that handling of objects versus arrays also comes into play, with objects being much more efficient. It could also be the time for hashmap lookup versus array contains, but these arrays are usually short.
On objects, Snowflake says
Frequently common paths are detected, projected out, and stored in separate (typed and compressed) columns in table file
The Snowflake Elastic Data Warehouse
On arrays, Snowflake says:
For better pruning and less storage consumption, we recommend flattening your OBJECT and key data into separate relational columns if your semi-structured data includes: Arrays
Considerations for Semi-structured Data Stored in VARIANT
So we don’t know exactly what’s going on, but we can conclude that in my query:
- For
array_agg()
we end up with a single column with difficult to compress arrays. Plus the partition stats probably don’t really work, since for an array, how do you calculate: min/max values, distinct values, sum, histogram, # nulls, dictionary, bloom filters etc. - For
object_agg()
we (probably) end up with many different columns of easier to compress 1s and nulls which are easy to calculate stats on too.
This is evident in the amount of data we have to scan for otherwise identical SELECT statements, which is about 3x lower for the object_agg()
table.
I have come to the obvious conclusion to prefer object_agg()
over array_agg()
when possible!
CTAS Query Stats
Here are the metrics for the CTAS statements, where it is evident that while scanning the same amount of bytes, the array_agg()
approach was 86x slower than the object_agg()
approach.
It was curious to me that so much of the time was spent on the CTAS, and not on the processing that took place before hand doing the SELECT steps. Spillage is known to be slow and surely contributed here, but the biggest difference is 2,000 seconds of processing versus just 16 seconds for the same input data!
Mertic | array_agg | object_agg |
Total Execution Time | 2607 | 30 |
Bytes scanned | 36.40GB | 36.44GB |
Percentage scanned from cache | 5.16% | 5.16% |
Bytes written | 2.74GB | 2.58GB |
Bytes sent over the network | 36.65GB | 33.27GB |
Partitions scanned | 27798 | 27798 |
Partitions total | 115627 | 115627 |
Bytes spilled to local storage | 31.86GB | 6.91GB |
Processing | 84.80% | 56.00% |
Local Disk I/O | 0.20% | 1.90% |
Remote Disk I/O | 0 | 34.80% |
Network Communication | 0 | 2.10% |
Synchronization | 2.10% | 3.00% |
Initialization | 12.90% | 2.30% |
Scan progress | 100.00% | 100.00% |
Most Expensive Nodes | CreateTableAsSelect 80.6% Sort 6.5% | TableScan 41.7% Aggregate 18.4% CreateTableAsSelect 14.6% |
SELECT Query Stats
Here are the stats for SELECT queries using array_contains()
versus is_null_value(ID_OBJECT.test)
It’s 23x faster! It scans less data! And there’s WAY less processing.
What’s also very revealing is the columns. As promised, Snowflake does not need to scan the entire ID_OBJECT
column, and instead did its thing with extracting common paths (1337 in this example) into columns and just scanning that.
Metric | array_contains | is_null_value |
Total Execution Time | 153 | 6.6 |
Bytes scanned | 147.33MB | 47.31MB |
Percentage scanned from cache | 0.00% | 0.00% |
Bytes sent over the network | 1.40MB | 2.45MB |
Partitions scanned | 81 | 70 |
Partitions total | 298 | 256 |
Processing | 88.70% | 2.50% |
Local Disk I/O | 0 | 0.30% |
Synchronization | 0.40% | 0 |
Remote Disk I/O | 0 | 56.90% |
Initialization | 10.90% | 40.30% |
Scan progress | 100.00% | 100.00% |
Columns | EVENT_DATE ID ID_ARRAY | EVENT_DATE MERCH_ID GET_PATH(ID_OBJECT, ‘[“1337”]’) (Extracted Variant Path) |
Most Expensive Nodes | TableScan 89.1% | TableScan 59.7% |
Clustering Info
Here’s the clustering info, where there’s a similar amount of micro-partitions but it’s evident that object_agg()
approach has better (fewer) overlaps and better (less) depth. Most notably, there are a few overlaps with very high depth — so these don’t help.
cluster_by_keys | LINEAR( event_date, ARRAY_CONTAINS(CAST(1125 AS VARIANT), ID_ARRAY) ) | LINEAR( event_date, coalesce(IS_NULL_VALUE(ID_OBJECT[‘1125’]),TRUE) ) |
total_partition_count | 298 | 256 |
total_constant_partition_count | 0 | 0 |
average_overlaps | 3.698 | 1.7031 |
average_depth | 4.0336 | 2.0234 |
partition_depth_histogram | ||
0 | 0 | 0 |
1 | 3 | 4 |
2 | 238 | 242 |
3 | 10 | 10 |
4 | 6 | 0 |
5 | 0 | 0 |
6 | 0 | 0 |
7 | 0 | 0 |
8 | 0 | 0 |
9 | 8 | 0 |
10 | 0 | 0 |
11 | 0 | 0 |
12 | 0 | 0 |
13 | 12 | 0 |
14 | 0 | 0 |
15 | 0 | 0 |
16 | 0 | 0 |
32 | 21 |
The following SQL is an example of what I’m talking about, but with 300e6+ rows.
WITH auctions AS (
SELECT
$1 AS auction_id,
$2 AS participant_id
FROM VALUES
(1, 123),
(1, 456),
(1, 789),
(2, 123),
(2, 789)
), agg AS (
SELECT
auction_id,
ARRAY_AGG(participant_id) AS array,
OBJECT_AGG(participant_id, 1) AS obj
FROM auctions
GROUP BY
1
)
SELECT
auction_id,
ARRAY_CONTAINS(CAST(456 AS VARIANT), array) AS array_contains,
COALESCE(IS_NULL_VALUE(obj['456']), TRUE) = FALSE AS object_key
FROM agg;
Code language: Python (python)