Snowflake Bloom Filters
Published on . Updated on
In 2019 I posted Snowflake Database Internals which contained many insights, but had one note that glossed over a very important detail. I wrote:
“Per file min/max values, #distinct values, #nulls, bloom filters etc.”
Last week I learned that “per file […] bloom filters” is only partially true from a post that said:
“The Search Optimization Service builds a set of Bloom Filters to track the partitions where the data isn’t. Using a patented Bloom filter solution, Snowflake automatically prunes (skips) micro partitions which means the fewer matching rows returned, the more extreme the performance gain.”
“Snowflake Search Optimization Service Best Practices” By John Ryan
So I reviewed the Snowflake paper and noticed I missed a very important phrase in bold below.
“Snowflake […] computes Bloom filters over all paths (not values!) present in the documents.”
The Snowflake Elastic Data Warehouse
The situation is now clear to me:
- By default, Snowflake micro-partitions maintain a bloom filter of the semi-structured paths contained within. So for example, if you inserted a bunch of documents like
{"foo": 456}
then a Bloom filter would be created with a entry forfoo
in the micro-partition metadata. Then if you queried formycol.bar = 123
, Snowflake can check for the membership ofbar
in the Bloom filter prior to scanning entire partition. - With SOS, Snowflake maintains additional metadata (a “pruning index”) about the values of a column, that allow the query engine to skip partitions that definitely don’t contain a certain value. The pruning index is a set of blocked bloom filters (a faster, CPU-cache-friendly, space efficient alternative to the standard bloom filter). There’s a bunch of research described in Patent US11803551B2 Pruning index generation and enhancement about how they choose the parameters for the bloom filters to trade off size, CPU cost, and accuracy. So when you insert
{"foo": 456}
then the blocked bloom filter is updated. When you query formycol.foo=123
snowflake can check whether 123 is definitely not in the set of distinct values for the column, and skip scanning the partition.
The patent is pretty dense for me, but if the blocked bloom filter erroneously says that 123 is present, then it results in scanning extra data. But accuracy comes at the cost of more disk space (any of longer bit-length of the bloom filter, larger n
or lower density) and/or more CPU use (e.g. more hash functuons).
There is no simple answer. For example, when the partitions are cached locally, then its not as costly to scan extra partitions. When updates are frequent, the CPU time is crucial. The optimal parameters may be different for a new partition versus an existing one. Everything also depends on the cardinality of the column.
Here is an except from the patent that I was able to mostly understand:
To this end, blocks within the pruning index are organized in a hierarchy that encodes the level of decomposition of the domain of values. As an example of the foregoing, FIG. 6 illustrates a single bloom filter 600 of a pruning index. In the example illustrated in FIG. 6, bloom filter is 2048 bytes and can represent 64 distinct values with a false positive rate of 1/1,000,0000. If the corresponding micro-partition of the source table contains more than 64 distinct values, the false positive rate would degrade as soon as the density of the bloom filter is larger than ½ (e.g., more bits are set than bits are unset). To address this issue, the compute service manage can, in some embodiments, build two bloom filters, with one bloom filter for each half of the domain.
Each of the bloom filters will be represented by two rows in the pruning index, identified by their level and slice number. Consistent with some embodiments, a particular value and its corresponding hash value maps to a single one of the blocks across all micro-partitions of the source table. Regardless of the level, a bit encodes a fixed subset of the domain.
In some embodiments, the number of hash functions to compute per bloom filter can be varied to improve performance. This optimization can reduce the CPU cost of building the pruning index while maintaining a target false positive rate for extremely large tables. Accordingly, in some embodiments, a user may specify a target false positive rate and the compute service manager may determine the number of hash functions to compute per bloom filter as well as the level based on the target false positive rate.
Pruning index generation and enhancement