Notebooks as Controllers

Last updated on 2025-10-30 | Edit this page

Overview

Questions

  • How do you set up and use Vertex AI Workbench notebooks for machine learning tasks?
  • How can you manage compute resources efficiently using a “controller” notebook approach in GCP?

Objectives

  • Describe how to use Vertex AI Workbench notebooks for ML workflows.
  • Set up a Jupyter-based Workbench instance as a controller to manage compute tasks.
  • Use the Vertex AI SDK to launch training and tuning jobs on scalable instances.

Setting up our notebook environment


Google Cloud Workbench provides JupyterLab-based environments that can be used to orchestrate machine learning workflows. In this workshop, we will use a Workbench Instance—the recommended option going forward, as other Workbench environments are being deprecated.

Workbench Instances come with JupyterLab 3 pre-installed and are configured with GPU-enabled ML frameworks (TensorFlow, PyTorch, etc.), making it easy to start experimenting without additional setup. Learn more in the Workbench Instances documentation.

Using the notebook as a controller


The notebook instance functions as a controller to manage more resource-intensive tasks. By selecting a modest machine type (e.g., n1-standard-4), you can perform lightweight operations locally in the notebook while using the Vertex AI Python SDK to launch compute-heavy jobs on larger machines (e.g., GPU-accelerated) when needed.

This approach minimizes costs while giving you access to scalable infrastructure for demanding tasks like model training, batch prediction, and hyperparameter tuning.

We will follow these steps to create our first Workbench Instance:

  • In the Google Cloud Console, search for “Workbench.”
  • Click the “Instances” tab (this is the supported path going forward).
  • Pin Workbench to your navigation bar for quick access.

2. Create a new Workbench Instance

Initial settings

  • Click Create New near the top of the Workbench page
  • Name: For this workshop, we can use the following naming convention to easily locate our notebooks: teamname-yourname-purpose (e.g., sinkorswim-johndoe-train)
  • Region: Choose the same region as your storage bucket (e.g., us-central1). This avoids cross-region transfer charges and keeps data access latency low.
    • If you are unsure, check your bucket’s location in the Cloud Storage console (click the bucket name → look under “Location”).
  • Zone: us-central1-a (or another zone in us-central1, like -b or -c)
    • If capacity or GPU availability is limited in one zone, switch to another zone in the same region.
  • NVIDIA T4 GPU: Leave unchecked for now
    • We will request GPUs for training jobs separately. Attaching here increases idle costs.
  • Apache Spark and BigQuery Kernels: Leave unchecked
    • Enable only if you specifically need Spark or BigQuery notebooks; otherwise, it adds unnecessary images.
  • Network in this project: Required selection
    • This option must be selected; shared environments do not allow using external or default networks.
    • This ensures your instance connects to the shared VPC for the workshop.
  • Network / Subnetwork: Leave as pre-filled. Notebook settings (part1)

Advanced settings: Details (tagging)

  • IMPORTANT: Open the “Advanced optoins menu next
    • Labels (required for cost tracking): Under the Details menu, add the following tags (all lowercase) so that you can track the total cost of your activity on GCP later:
      • project = teamname (your team’s name)
      • name = name (firstname-lastname)
      • purpose = train (i.e., the notebook’s overall purpose — train, tune, RAG, etc.)
Screenshot showing required tags for notebook
Required tags for notebook.

Advanced Settings: Environment

While we won’t modify environment settings during this workshop, it’s useful to understand what these options control when creating or editing a Workbench Instance in Vertex AI Workbench.

All Workbench environments use JupyterLab 3 by default, with the latest NVIDIA GPU drivers, CUDA libraries, and Intel optimizations preinstalled. You can optionally select JupyterLab 4 (currently in preview) or provide a custom container image to run your own environment (for example, a Docker image containing specialized ML frameworks or dependencies). If needed, you can also specify a post-startup script stored in Cloud Storage (gs://path/to/script.sh) to automatically configure the instance (install packages, mount buckets, etc.) when it boots.

See: Vertex AI Workbench release notes for supported versions and base images.

Advanced settings: Machine Type

  • Machine type: Select a small machine (e.g., n2-standard-2) to act as the controller.
    • This keeps costs low while you delegate heavy lifting to training jobs.
    • For guidance on common machine types for ML, refer to Instances for ML on GCP.
  • Set idle shutdown: To save on costs when you aren’t doing anything in your notebook, lower the default idle shutdown time to 60 (minutes).
Set Idle Shutdown
Enable Idle Shutdown

Advanced Settings: Disks

Each Vertex AI Workbench instance uses Persistent Disks (PDs) to store your system files and data. You’ll configure two disks when creating a notebook: a boot disk and a data disk. We’ll leave these at their default settings, but it’s useful to understand the settings for future work.

Boot Disk

Keep this fixed at 100 GB (Balanced PD) — the default minimum.
It holds the OS, JupyterLab, and ML libraries but not your datasets.
Estimated cost: about $10 / month (~$0.014 / hr).
You rarely need to resize this, though you can increase to 150–200 GB if you plan to install large environments, custom CUDA builds, or multiple frameworks.

Data Disk

This is where your datasets, checkpoints, and outputs live.
Use a Balanced PD by default, or an SSD PD only for high-I/O workloads.
A good rule of thumb is to allocate ≈ 2× your dataset size, with a minimum of 150 GB and a maximum of 1 TB.
For example: - 20 GB dataset → 150 GB data disk (minimum)
- 100 GB dataset → 200 GB data disk
- Larger datasets → keep the full dataset in Cloud Storage (gs://) and copy only subsets locally.

Persistent Disks can be resized anytime without downtime, so it’s best to start small and expand when needed.

Deletion behavior

The ‘Delete to trash’ option is unchecked by default, which is what you want.
When left unchecked, deleted files are removed immediately, freeing up disk space right away.
If you check this box, files will move to the system trash instead — meaning they still take up space (and cost) until you empty it.

Keep this unchecked to avoid paying for deleted files that remain in the trash.

Cost awareness

Persistent Disks are fast but cost more than Cloud Storage.
Typical rates:
- Balanced PD: ~$0.10–$0.12 / GB / month
- SSD PD: ~$0.17–$0.20 / GB / month
- Cloud Storage (Standard): ~$0.026 / GB / month

Rule of thumb: use PDs only for active work; store everything else in Cloud Storage.
Example: a 200 GB dataset costs ~$24/month on a PD but only ~$5/month in Cloud Storage.

Check the latest pricing here:
- Persistent Disk & Image pricing
- Cloud Storage pricing

Advanced settings: Networking - External IP Access

  • Assign External IP address: Leave this option checked — you need an external IP.

Create notebook

  • Click Create to create the instance. Your notebook instance will start in a few minutes. When its status is “Running,” you can open JupyterLab and begin working.

Managing training and tuning with the controller notebook

In the following episodes, we will use the Vertex AI Python SDK (google-cloud-aiplatform) from this notebook to submit compute-heavy tasks on more powerful machines. Examples include:

  • Training a model on a GPU-backed instance.
  • Running hyperparameter tuning jobs managed by Vertex AI.

This pattern keeps costs low by running your notebook on a modest VM while only incurring charges for larger resources when they are actively in use.

Challenge

Challenge: Notebook Roles

Your university provides different compute options: laptops, on-prem HPC, and GCP.

  • What role does a Workbench Instance notebook play compared to an HPC login node or a laptop-based JupyterLab?
  • Which tasks should stay in the notebook (lightweight control, visualization) versus being launched to larger cloud resources?

The notebook serves as a lightweight control plane.
- Like an HPC login node, it is not meant for heavy computation.
- Suitable for small preprocessing, visualization, and orchestrating jobs.
- Resource-intensive tasks (training, tuning, batch jobs) should be submitted to scalable cloud resources (GPU/large VM instances) via the Vertex AI SDK.

Load pre-filled Jupyter notebooks

Once your newly created instance shows as Active (green checkmark), click Open JupyterLab to open the instance in Jupyter Lab. From there, we can create as many Jupyter notebooks as we would like within the instance environment.

We will then select the standard python3 environment to start our first .ipynb notebook (Jupyter notebook). We can use this environment since we aren’t doing any training/tuning just yet.

Within the Jupyter notebook, run the following command to clone the lesson repo into our Jupyter environment:

SH

!git clone https://github.com/qualiaMachine/Intro_GCP_for_ML.git

Then, navigate to /Intro_GCP_for_ML/notebooks/04-Accessing-and-managing-data.ipynb to begin the first notebook.

Key Points
  • Use a small Workbench Instance notebook as a controller to manage larger, resource-intensive tasks.
  • Always navigate to the “Instances” tab in Workbench, since older notebook types are deprecated.
  • Choose the same region for your Workbench Instance and storage bucket to avoid extra transfer costs.
  • Submit training and tuning jobs to scalable instances using the Vertex AI SDK.
  • Labels help track costs effectively, especially in shared or multi-project environments.
  • Workbench Instances come with JupyterLab 3 and GPU frameworks preinstalled, making them an easy entry point for ML workflows.
  • Enable idle auto-stop to avoid unexpected charges when notebooks are left running.