Release Review - Dataproc Stop/Start-able Clusters

Wed Mar 24, 2021 in optimize, dataproc, release review

In March-2021, Google released¹ for General Availability a Dataproc enhancement supporting Stop/Startable Clusters². This feature allows you to stop an inactive Dataproc cluster, to later be re-started when it is needed again.

In this article, we’ll take a closer look at this feature, with an aim to better understand its cost implications and to help you determine if it will be helpful to your cost-reduction efforts.

Feature Overview

When you create a Dataproc cluster, you are paying for the GCE instances to support that cluster for the duration of that cluster’s lifetime.

This new feature provides the ability to stop the GCE instances while the cluster is not needed for jobs, and then to start the cluster again when you need it.

e.g.:

$ gcloud dataproc clusters stop cluster-name \
    --region=region

$ gcloud dataproc clusters stop cluster-name \
    --region=region

This feature is useful if you are using Dataproc with local HDFS storage that takes a substantial time to populate. You probably have long-running (24/7) Dataproc clusters in this situation.

Some Dataproc use-cases are ephemeral. Clusters are only created for the duration of a job or set of jobs, and then the cluster is deleted.

For ephemeral use-cases, the Stop/Start would only be a beneficial alternative if eliminating the cluster provisioning process is important to you.

Cost Implications

By stopping the GCE instances, you are no longer billed for CPU or Memory. You are, however, still billed for the persistent disks for these instances, and potentially other resources such as public IP addresses.

Additional Considerations

A standard practice we recommend for ephemeral Dataproc use-cases is to use the Cluster Scheduled Deletion --max-idle flag to automatically de-provision the cluster after a period of inactivity.

We also recommend making use of Cluster Auto-scaling, and Preemptable Secondary Workers, where your Dataproc use allows.

Summary

Dataproc Start/Stop functionality can help you reduce cluster costs for long-running Dataproc clusters where there is a significant amount of data seeded in its local HDFS filesystem.