In April-2021, Google released1 for Preview a BigQuery enhancement supporting Table Sampling2. This feature allows you to process a subset of the data in a BigQuery table while reducing costs associated with scanning and processing the entire table or partition.
In this article, we’ll take a closer look at this feature, with an aim to better understand its cost implications and to help you determine if it will be helpful to your cost-management efforts.
Background - BigQuery Usage Based Billing
When you query a BigQuery dataset, an entire table or partition may be processed. If you are using Pay-Per-Use, you a billed according to the amount of data processed. This cost is reflected in the “Analysis” SKU for the BigQuery product billing.
There are ways to prune the amount of data processed, including best practices such as: only
SELECTing columns you need, and by leveraging sharding, partitioning, and clustering3. This, however, requires the analyst to understand how a table is partitioned and to write their queries accordingly, and may still result in excessive data being processed.
A common misconception is that adding a
LIMIT clause to a BigQuery query will reduce a query’s cost, however, this has no effect on the amount of data processed, and therefore no effect on the cost of the query.
There are ‘Preview’ mechanisms3 made available by BigQuery that are helpful for table exploration, however, their usefulness is hindered by not being usable via SQL query, and not
JOINable within a query.
TABLESAMPLE - An Example
The new Table Sampling feature provides a way to reduce the amount of data processed by a query when partial results are acceptable. It can be used instead of, or in conjunction with, partitioning, to reduce the amount of data evaluated by a query.
For example, this simple query pulls a sample of data from an 18.2TiB table, and will result in a scan of the entire table:
This query will process 18.2TiB. Adding a
LIMIT BY doesn’t change how much data is processed by this query:
Using a partition key, we can reduce the amount of data to 90.7GiB, which is better, but may still be more than want:
TABLESAMPLE clause instead of partitioning reduces the amount of data processed to just 1.8GiB::
Using both the table partition to select a single partition as well as
TABLESAMPLE at the same time allows us to reduce the amount of data processed to only 61.5MiB:
The reduction in data returned will improve query performance as well as cost.
There are two pricing models for BigQuery analysis costs: Fixed costs via Reserved Slots, and Pay-Per-Use.
The distinctions between these two models are covered in detail in another article3, but the primary cost savings using this strategy will be for those under the Pay-per-use model.
Under that model, you are billed $5/TB. The preceding example would result in a $91 reduction in cost.
For those under a Reserved Slot model, the primary advantage of using
TABLESAMPLE will be in reduced slot usage as well as improved query performance, which may, in turn, lead to cost-saving opportunities if you are able to decrease the number of reserved slots you require.
BigQuery’s new Table Sample feature is another tool in your toolbox for managing the cost of BigQuery analysis. It allows you to use SQL syntax to reduce the amount of data processed by a query when you do not need the full table to be scanned and evaluated for your results. This can be helpful for exploratory, development, testing queries, or any query where partial results are acceptable.