In March-2021, Google released1 for General Availability a Dataproc enhancement supporting Stop/Startable Clusters2. This feature allows you to stop an inactive Dataproc cluster, to later be re-started when it is needed again.
In this article, we’ll take a closer look at this feature, with an aim to better understand its cost implications and to help you determine if it will be helpful to your cost-reduction efforts.
When you create a Dataproc cluster, you are paying for the GCE instances to support that cluster for the duration of that cluster’s lifetime.
This new feature provides the ability to
stop the GCE instances while the cluster is not needed for jobs, and then to
start the cluster again when you need it.
$ gcloud dataproc clusters stop cluster-name \ --region=region $ gcloud dataproc clusters stop cluster-name \ --region=region
This feature is useful if you are using Dataproc with local HDFS storage that takes a substantial time to populate. You probably have long-running (24/7) Dataproc clusters in this situation.
Some Dataproc use-cases are ephemeral. Clusters are only created for the duration of a job or set of jobs, and then the cluster is deleted.
For ephemeral use-cases, the Stop/Start would only be a beneficial alternative if eliminating the cluster provisioning process is important to you.
By stopping the GCE instances, you are no longer billed for CPU or Memory. You are, however, still billed for the persistent disks for these instances, and potentially other resources such as public IP addresses.
A standard practice we recommend for ephemeral Dataproc use-cases is to use the Cluster Scheduled Deletion
--max-idle flag to automatically de-provision the cluster after a period of inactivity.
Dataproc Start/Stop functionality can help you reduce cluster costs for long-running Dataproc clusters where there is a significant amount of data seeded in its local HDFS filesystem.