Nat TaylorBlog, Product Management & Tinkering

Snowflake array_agg & object_agg Performance

Published on . Updated on

Recently I needed to aggregate some rows in SQL, and then check for the membership of an ID in the aggregated row. I first turned to creating a table as select using array_agg() and the querying it with array_contains() The performance was OK for a month so I thought it was done, but when I tried to backfill a year I hit a major performance bottleneck. It took over 37 minutes for a dedicated 2XL cluster to finish a CTAS statement that scanned just 36GB of data!

After some panic and head scratching, I re-implemented it with object_agg() and is_null_value() resulting in an 86x speed up of the CTAS and a 23x speed up of SELECTs. Within the CTAS, 2,000 seconds and 16 seconds were spent on processing respectively on the same 36GB of data!

I’ve included some similar toy SQL and a bunch of the query profile metrics below, but this basically came down to:

  1. select id,array_agg(cid) from foo group by 1;
  2. select id,object_agg(cid, 1) from foo group by 1;

(Note: I wanted to future-proof the inclusion of new cid so I didn’t want to explode out into structured columns for each cid. I did not really consider using pivot() but maybe that would have been good. My array_agg() implementation assumes that cid don’t repeat, which turns out to be not the case so array_agg(distinct cid) would have been better. I didn’t guarantee lab-like benchmarking conditions, although things are pretty similar)

Contemplating The Performance Differences

Snowflake support said the difference was due to “data skew” but I wonder if there’s more to it.

When I checked your query most of the time was spent on data skew, due to this, not all the worker’s nodes not utilized to their full extent. With the change, you made the data is evenly distributed across the nodes and so it is much faster.

— Snowflake Support

They weren’t wrong! A few of the arrays had 100s of elements.

ARRAY_SIZE(ID_ARRAY)_ROWS
1755,177
289,542
321,348
414,062
53,668
63,337
7784
81,308
9198
427
4949
7575
177177
284284
454454

My suspicion is that handling of objects versus arrays also comes into play, with objects being much more efficient. It could also be the time for hashmap lookup versus array contains, but these arrays are usually short.

On objects, Snowflake says

Frequently common paths are detected, projected out, and stored in separate (typed and compressed) columns in table file

The Snowflake Elastic Data Warehouse

On arrays, Snowflake says:

For better pruning and less storage consumption, we recommend flattening your OBJECT and key data into separate relational columns if your semi-structured data includes: Arrays

Considerations for Semi-structured Data Stored in VARIANT

So we don’t know exactly what’s going on, but we can conclude that in my query:

This is evident in the amount of data we have to scan for otherwise identical SELECT statements, which is about 3x lower for the object_agg() table.

I have come to the obvious conclusion to prefer object_agg() over array_agg() when possible!

CTAS Query Stats

Here are the metrics for the CTAS statements, where it is evident that while scanning the same amount of bytes, the array_agg() approach was 86x slower than the object_agg() approach.

It was curious to me that so much of the time was spent on the CTAS, and not on the processing that took place before hand doing the SELECT steps. Spillage is known to be slow and surely contributed here, but the biggest difference is 2,000 seconds of processing versus just 16 seconds for the same input data!

Merticarray_aggobject_agg
Total Execution Time260730
Bytes scanned36.40GB36.44GB
Percentage scanned from cache5.16%5.16%
Bytes written2.74GB2.58GB
Bytes sent over the network36.65GB33.27GB
Partitions scanned2779827798
Partitions total115627115627
Bytes spilled to local storage31.86GB6.91GB
Processing84.80%56.00%
Local Disk I/O0.20%1.90%
Remote Disk I/O034.80%
Network Communication02.10%
Synchronization2.10%3.00%
Initialization12.90%2.30%
Scan progress100.00%100.00%
Most Expensive NodesCreateTableAsSelect 80.6%
Sort 6.5%
TableScan 41.7%
Aggregate 18.4%
CreateTableAsSelect 14.6%
CTAS Query Stats

SELECT Query Stats

Here are the stats for SELECT queries using array_contains() versus is_null_value(ID_OBJECT.test)

It’s 23x faster! It scans less data! And there’s WAY less processing.

What’s also very revealing is the columns. As promised, Snowflake does not need to scan the entire ID_OBJECT column, and instead did its thing with extracting common paths (1337 in this example) into columns and just scanning that.

Metricarray_containsis_null_value
Total Execution Time1536.6
Bytes scanned147.33MB47.31MB
Percentage scanned from cache0.00%0.00%
Bytes sent over the network1.40MB2.45MB
Partitions scanned8170
Partitions total298256
Processing88.70%2.50%
Local Disk I/O00.30%
Synchronization0.40%0
Remote Disk I/O056.90%
Initialization10.90%40.30%
Scan progress100.00%100.00%
ColumnsEVENT_DATE
ID
ID_ARRAY
EVENT_DATE
MERCH_ID
GET_PATH(ID_OBJECT, ‘[“1337”]’) (Extracted Variant Path)
Most Expensive NodesTableScan 89.1%TableScan 59.7%
SELECT Query Stats

Clustering Info

Here’s the clustering info, where there’s a similar amount of micro-partitions but it’s evident that object_agg() approach has better (fewer) overlaps and better (less) depth. Most notably, there are a few overlaps with very high depth — so these don’t help.

cluster_by_keysLINEAR( event_date, ARRAY_CONTAINS(CAST(1125 AS VARIANT), ID_ARRAY) )LINEAR( event_date, coalesce(IS_NULL_VALUE(ID_OBJECT[‘1125’]),TRUE) )
total_partition_count298256
total_constant_partition_count00
average_overlaps3.6981.7031
average_depth4.03362.0234
partition_depth_histogram
000
134
2238242
31010
460
500
600
700
800
980
1000
1100
1200
13120
1400
1500
1600
3221
Clustering Info

The following SQL is an example of what I’m talking about, but with 300e6+ rows.

WITH auctions AS (
  SELECT
    $1 AS auction_id,
    $2 AS participant_id
  FROM VALUES
    (1, 123),
    (1, 456),
    (1, 789),
    (2, 123),
    (2, 789)
), agg AS (
  SELECT
    auction_id,
    ARRAY_AGG(participant_id) AS array,
    OBJECT_AGG(participant_id, 1) AS obj
  FROM auctions
  GROUP BY
    1
)
SELECT
  auction_id,
  ARRAY_CONTAINS(CAST(456 AS VARIANT), array) AS array_contains,
  COALESCE(IS_NULL_VALUE(obj['456']), TRUE) = FALSE AS object_key
FROM agg;

Popular Posts

Post Navigation

«
»