Release Review - BigQuery Table Sampling

Thu Apr 01, 2021 in optimize, bigquery, release review

In April-2021, Google released¹ for Preview a BigQuery enhancement supporting Table Sampling². This feature allows you to process a subset of the data in a BigQuery table while reducing costs associated with scanning and processing the entire table or partition.

In this article, we’ll take a closer look at this feature, with an aim to better understand its cost implications and to help you determine if it will be helpful to your cost-management efforts.

Feature Overview

Background - BigQuery Usage Based Billing

When you query a BigQuery dataset, an entire table or partition may be processed. If you are using Pay-Per-Use, you a billed according to the amount of data processed. This cost is reflected in the “Analysis” SKU for the BigQuery product billing.

There are ways to prune the amount of data processed, including best practices such as: only SELECTing columns you need, and by leveraging sharding, partitioning, and clustering³. This, however, requires the analyst to understand how a table is partitioned and to write their queries accordingly, and may still result in excessive data being processed.

A common misconception is that adding a LIMIT clause to a BigQuery query will reduce a query’s cost, however, this has no effect on the amount of data processed, and therefore no effect on the cost of the query.

There are ‘Preview’ mechanisms³ made available by BigQuery that are helpful for table exploration, however, their usefulness is hindered by not being usable via SQL query, and not JOINable within a query.

Using `TABLESAMPLE` - An Example

The new Table Sampling feature provides a way to reduce the amount of data processed by a query when partial results are acceptable. It can be used instead of, or in conjunction with, partitioning, to reduce the amount of data evaluated by a query.

For example, this simple query pulls a sample of data from an 18.2TiB table, and will result in a scan of the entire table:

BigQuery - Select

This query will process 18.2TiB. Adding a LIMIT BY doesn’t change how much data is processed by this query:

BigQuery - Select with LIMIT

Using a partition key, we can reduce the amount of data to 90.7GiB, which is better, but may still be more than want:

BigQuery - Select with Partition

Using a TABLESAMPLE clause instead of partitioning reduces the amount of data processed to just 1.8GiB::

BigQuery - Select with TABLESAMPLE

Using both the table partition to select a single partition as well as TABLESAMPLE at the same time allows us to reduce the amount of data processed to only 61.5MiB:

BigQuery - Select with TABLESAMPLE and Partition

The reduction in data returned will improve query performance as well as cost.

Cost Implications

There are two pricing models for BigQuery analysis costs: Fixed costs via Reserved Slots, and Pay-Per-Use.

The distinctions between these two models are covered in detail in another article³, but the primary cost savings using this strategy will be for those under the Pay-per-use model.

Under that model, you are billed $5/TB. The preceding example would result in a $91 reduction in cost.

For those under a Reserved Slot model, the primary advantage of using TABLESAMPLE will be in reduced slot usage as well as improved query performance, which may, in turn, lead to cost-saving opportunities if you are able to decrease the number of reserved slots you require.

Summary

BigQuery’s new Table Sample feature is another tool in your toolbox for managing the cost of BigQuery analysis. It allows you to use SQL syntax to reduce the amount of data processed by a query when you do not need the full table to be scanned and evaluated for your results. This can be helpful for exploratory, development, testing queries, or any query where partial results are acceptable.

Feature Overview

Background - BigQuery Usage Based Billing

Using TABLESAMPLE - An Example

Cost Implications

Summary

Using `TABLESAMPLE` - An Example