Nat TaylorBlog, AI, Product Management & Tinkering

Snowflake Database Internals

Published on .

by Nat Taylor <nattaylor@gmail.com>

I am routinely amazed by how fast and easy using Snowflake is, so I’ve poked and prodded at the internals and when I have an “a ha” moment, I write it down. I’ve also been a Top 20 answerer on Stack Overflow for questions tagged with #snowflake-cloud-data-platform.

This page annotates selected Content from “The Snowflake Elastic Data Warehouse” with those “a ha” moments and is built upon some of Snowflake’s performance related details from the creators SIGMOD Presentation “The Snowflake Elastic Data Warehouse.” Click a citation to see a note.

  1. Table Storage
    1. Snowflake uses PAX [Ailamaki01[#]] aka hybrid columnar storage[#]
    2. Tables horizontally partitioned into immutable[#] mirco-partitions (~16 MB)
      1. Updates add or remove entire files
      2. Values of each column grouped together and compressed[#]
      3. Queries read header + columns they need[#] 
  2. Execution Engine
    1. Columnar [MonetDB[#], C-Store, many more]
      1. Effective use of CPU caches[#], SIMD[#] instructions, and compression[#]
    2. Vectorized [Zukowski05[#]]
      1. Operators handle batches of a few thousand rows in columnar format
      2. Avoids materialization of intermediate results
    3. Push-based [Neumann11[#] and many before that]
      1. Operators push results to downstream operators (no Volcano iterators[#])
      2. Removes control logic from tight loops
      3. Works well with DAG-shaped[#] plans
    4. No transaction management, no buffer pool
      1. But: most operators (join, group by, sort) can spill to disk and recurse
  3. Pruning
    1. Database adage: The fastest way to process data? Don’t.
      1. Limiting access only to relevant data is key aspect of query processing
    2. Traditional solution: B+-trees and other indices
      1. Poor fit for us: random accesses, high load time, manual tuning
    3. Snowflake approach: pruning[#]
      1. AKA small materialized aggregates [Moerkotte98[#]], zone maps [Netezza[#]], data skipping [IBM[#]]
      2. Per file min/max values, #distinct values, #nulls, bloom filters etc.
      3. Use metadata to decide which files are relevant for a given query[#]
      4. Smaller than indices, more load-friendly[#], no user input required
  4. Schema-Less Data
    1. Cloudera Impala, Google BigQuery/Dremel
      1. Columnar storage and processing of semi-structured data
      2. But: full schema required up front!
    2. Snowflake introduces automatic type inference and columnar storage for schema-less data (VARIANT[#])
      1. Frequently common paths are detected, projected out, and stored in separate (typed and compressed) columns in table file[#]
      2. Collect metadata on these columns for use by optimizer → pruning
      3. Independent for each micro-partition → schema evolution
  5. Metadata
    1. Metadata stored in a transactional key-value store[#] (not S3)
      1. Which table consists of which S3 objects
      2. Optimizer statistics, lock tables, transaction logs etc.
    2. Bloom filters – store the keys of semistructured data, but not the values (unless search optimization service is enabled) read more

Notes

The paper is titled “Weaving Relations for Cache Performance

The approach achieves high CPU cache performance by changing the in-page data placement from row to columnar without a penalty.

  1. this is datapage in the database sense, so typically about 4kb of datarows are laid out on to a “page”
  2. this works well, and snowflake (and the authors of that paper) want to keep that dataset on that page
  3. but the research proposes a new way to organize(/layout) the data on the page, such that when the CPU says “Give me next value” that next value (and the next and the next) are usually in the cpu-cache already, which prevents a slow lookup from ram/disk
Parquet implement PAX (partition across attributes) which is columnar storage split(/partitioned!) into sets of rows. This is why updates are expensive. All the metadata has to be recalculated and the partition must be rewritten. Thanks to columnar storage. Blob stores like S3 support byte range reads, so one file can contain many columns. This even works for pseudo-columns within semi-structured data. Hence the tips on avoiding SELECT * FROM FOO since that will require reading all the columns. See the Zukowski05 paper below See the Ailamaki01 paper above Single instruction, multiple data. So… sum vectors of integers in parallel etc

Snowflake docs suggest they typically compression 50-500mb down to around 16mb and “automatically determines the most efficient compression algorithm for the column

Apache Parquet describes encodings including:

  • Dictionary Encoding
  • Run Length Encoding / Bit-Packing Hybrid
  • Delta Encoding
  • Delta-length byte array
  • Delta Strings
They implement query execution with vector processing to avoid low instructions-per-cycle. This paper is titled “MonetDB/X100: Hyper-Pipelining Query Execution“. See also Spark Tungsten

The paper is “Efficiently Compiling Efficient Query Plans for Modern Hardware

As main memory grows, query performance is more and more determined by the raw CPU costs of query processing itself. The classical iterator style query processing technique is very simple and flexible, but shows poor performance on modern CPUs due to lack of locality and frequent instruction mispredictions. Several techniques like batch oriented processing or vectorized tuple processing have been proposed in the past to improve this situation, but even these techniques are frequently out-performed by hand-written execution plans.
In this work we present a novel compilation strategy that translates a query into compact and efficient machine code using the LLVM compiler framework. By aiming at good code and data locality and predictable branch layout the resulting code frequently rivals the performance of handwritten C++ code. We integrated these techniques into the HyPer main memory database system and show that this results in excellent query performance while requiring only modest compilation time.

Volcano iterators cause a lot of CPU instruction cache misses Directed Acyclic Graph (a graph that flows in one direction, where no element can be a child of itself)

You can see the “scanned partitions” versus “total partitions” in the Query Profile.

“Snowflake keeps pruning-related metadata for every individual table file. The metadata not only covers plain relational columns, but also a selection of auto-detected columns inside of semi-structured data, see Section 4.3.2. During optimization, the metadata is checked against the query predicates to reduce (“prune”) the set of input files for query execution. The optimizer performs pruning not only for simple base-value predicates, but also for more complex expressions such as WEEKDAY(orderdate) IN (6, 7).
Besides this static pruning, Snowflake also performs dynamic pruning during execution. For example, as part of hash join processing, Snowflake collects statistics on the distribution of join keys in the build-side records. This information is then pushed to the probe side and used to filter and possibly skip entire files on the probe side. This is in addition to other well-known techniques such as bloom joins [40].”

This paper is “Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing” and tracks per-partition stats like

  • min/max values
  • #distinct values
  • sum
  • histogram
  • #nulls
  • dictionary
  • bloom filters etc.
The paper “Fast Loads and Fast Queries” is a way to implement small materialized aggregates.

They decide how to partition with a strategy from the paper “Fine-grained Partitioning for Aggressive Data Skipping

Using 10-20 filters(/features) and 50k-100k rows to determine the best partioning strategy, results in a 2-5x query time speedup over hash(/similar) partitioning.

The feature-vectors can be re-used for pruning if a query contains one or more of the feature-filters.

“We first extract representative filters in a workload as features using frequent itemset mining.”

NT: They look at query history

“Based on these features, each data tuple can be represented as a feature vector.”

NT: They also group-by here, to weight the vectors and reduce the # of vectors

“We then formulate the blocking problem as a optimization problem on the feature vectors, called Balanced MaxSkip Partitioning.”

NT: Balance is key so that the partitions come out to be the same size

“To find an approximate solution efficiently, we adopt the bottom-up clustering framework.”

For every partition, you can deem it relevant, not-relevant or ambivalent and immediately toss out the relevant They still have to do a bunch of statistics at load time and store a bunch of meta-data, but much less work than a B-tree What happens behind the scenes appears to be the creation of virtual columns, for frequent paths, that benefit from all the Snowflake goodness (e.g. great compression and byte-range lookups.)

Created virtual columns for paths with JSON is amazing to me.

You can get an idea of how they might calculate the stats for this here: Listing Distinct Key Names/Common Schema in JSON Data

So you could build a list of the most common paths in a JSON object and then project those, and keep metadata statistics useful for joining, etc

Snowflake announced that they use FoundationDB for this.

You can see some query profiles say “Metadata only”

To make it easy to add new metadata objects, we built an object-mapping layer on top of key-values. Schema definition, evolution and metadata versioning are done by this layer as well. User-visible objects, such as catalog definitions, users, sessions, access control, copy history and others all have metadata backing them. Every statement executed has a metadata entry, along with statistics of its execution. Transaction state and lock queues are also persisted in FoundationDB. In fact, lock queues are implemented using the watch feature mentioned earlier. A data manipulation statement is enqueued on a resource’s lock queue, and a FoundationDB watch notifies the statement when the statement reached the front of the resource’s queue. There is also user-invisible metadata such as data distribution information, servers and encryption keys.”

Post Navigation

«
»