Content from Overview of Google Cloud for Machine Learning and AI


Last updated on 2026-03-04 | Edit this page

Overview

Questions

  • Why would I run ML/AI experiments in the cloud instead of on my laptop or an HPC cluster?
  • What does GCP offer for ML/AI, and how is it organized?
  • What is the “notebook as controller” pattern?

Objectives

  • Identify when cloud compute makes sense for ML/AI work.
  • Describe what GCP and Vertex AI provide for ML/AI researchers.
  • Explain the notebook-as-controller pattern used throughout this workshop.

Why run ML/AI in the cloud?


You have ML/AI code that works on your laptop. But at some point you need more — a bigger GPU (or multiple GPUs), a dataset that won’t fit on disk, or the ability to run dozens of training experiments overnight. You could invest in local hardware or compete for time on a shared HPC cluster, but cloud platforms let you rent exactly the hardware you need, for exactly as long as you need it, and then shut it down.

Cloud vs. university HPC clusters

Most universities offer shared HPC clusters with GPUs. These are excellent resources — but they have tradeoffs worth understanding:

Factor University HPC Cloud (GCP)
Cost Free or subsidized Pay per hour
GPU availability Shared queue; wait times during peak periods and per-job runtime limits (often 24–72 hrs) that may require checkpointing long training runs On-demand (subject to quota); jobs run as long as needed
Hardware variety Fixed hardware refresh cycle (3–5 years) Latest GPUs available immediately (A100, H100, L4)
Scaling Limited by cluster size Spin up hundreds of jobs in parallel
Multi-GPU / NVLink Sometimes available, depends on cluster Available on demand (e.g., A2/A3 instances with NVLink-connected multi-GPU nodes) — essential for training, fine-tuning, or serving large LLMs that don’t fit in a single GPU’s memory
Job orchestration Writing scheduler scripts, packaging environments, and wiring up parallel job arrays can take days of refactoring A few SDK calls: define a job, set hardware, call .run() — parallelism (e.g., tuning trials) is built in
Software environment Module system; some clusters support Apptainer/Singularity containers — research computing staff can often help with setup Vertex AI provides prebuilt containers for common ML frameworks (PyTorch, XGBoost, TensorFlow); add extra packages via a requirements list, or bring your own Docker image for full control
Power & cooling Paid for by the university; campus data centers often spend nearly as much energy on cooling as on the computers themselves Google’s data centers are roughly twice as energy-efficient as a typical campus facility — and power, cooling, and hardware failures are their problem, not yours

The short version: use your university cluster when it has the hardware you need and the queue isn’t blocking you. Use the cloud when you need hardware your cluster doesn’t have, need to scale beyond what the queue allows, or need a specific software environment you can’t easily get on-campus.

Many researchers use both — develop and test on HPC, then scale to cloud for large experiments or specialized hardware. This workshop teaches the cloud side of that workflow.

When does model size justify cloud compute?

Not every model needs cloud hardware. Here’s a rough guide:

Model scale Parameters Example models Where to run
Small < 10M Logistic regression, small CNNs, XGBoost Laptop or HPC — cloud adds overhead without much benefit
Medium 10M–500M ResNets, BERT-base, mid-sized transformers HPC with a single GPU (RTX 2080 Ti, L40) or cloud (T4, L4)
Large 500M–10B GPT-2, LLaMA-7B, fine-tuning large transformers HPC with A100 (40/80 GB) or cloud — both work well
Very large 10B–70B LLaMA-70B, Mixtral HPC with H100/H200 (80–141 GB) or cloud multi-GPU nodes
Frontier 70B+ GPT-4-scale, multi-expert models Cloud — requires multi-node clusters beyond what most HPC queues offer

CHTC’s GPU Lab covers more than you might think. The GPU Lab includes A100s (40 and 80 GB), H100s (80 GB), and H200s (141 GB) — enough VRAM to run inference or fine-tune models up to ~70B parameters on a single GPU with quantization. For many UW researchers, this hardware handles “large model” workloads without needing cloud. Jobs have time limits (12 hrs for short, 24 hrs for medium, 7 days for long jobs), so plan your training runs accordingly.

Cloud becomes the clear choice when you need interconnected multi-GPU nodes (NVLink) for large distributed training, hardware beyond what the GPU Lab queue offers, or when queue wait times are blocking a deadline.

A note on cloud costs

Cloud computing is not free, but it’s worth putting costs in context:

  • Hardware is expensive and ages fast. A single A100 GPU costs ~ $15,000 and is outdated within a few years. Cloud lets you rent the latest hardware by the hour.
  • You pay only for what you use. Stop a VM and the meter stops — valuable for bursty research workloads.
  • Managed services save development time. You don’t have to build DAGs, write scheduling logic, package custom containers, or maintain orchestration infrastructure — GCP handles that plumbing so you can focus on the ML.
  • Budgets and alerts keep you safe. GCP billing dashboards and budget alerts help prevent surprise bills. We cover cleanup in Episode 9.

The key habit: choose the right machine size, stop resources when idle, and monitor spending. We’ll reinforce this throughout.

Callout

For UW-Madison researchers

UW-Madison offers reduced-overhead cloud billing, NIH STRIDES discounts, Google Cloud research credits (up to $5,000), free on-campus GPUs via CHTC, and dedicated support from the Public Cloud Team. See the UW-Madison Cloud Resources page for details.

Google Cloud Platform (GCP) is one of several clouds that supports this. The rest of this episode explains what GCP offers for ML/AI and how the pieces fit together.

What GCP provides for ML/AI


GCP gives you three things that matter for applied ML/AI research:

Flexible compute. You pick the hardware that fits your workload:

  • CPUs for lightweight models, preprocessing, or feature engineering.
  • GPUs (NVIDIA T4, L4, V100, A100, H100) for training deep learning models. For help choosing, see Compute for ML.
  • TPUs (Tensor Processing Units) — Google’s custom hardware for matrix-heavy workloads. TPUs work best with TensorFlow and JAX; PyTorch support is improving but still less mature.

Scalable storage. Google Cloud Storage (GCS) buckets give you a place to store datasets, scripts, and model artifacts that any job or notebook can access. Think of it as a shared filesystem for your project.

Managed ML/AI services. Vertex AI is Google’s ML/AI platform. It wraps compute, storage, and tooling into a set of services designed for ML/AI workflows — managed notebooks, training jobs, hyperparameter tuning, model hosting, and access to foundation models like Gemini.

How the pieces fit together: Vertex AI


Google Cloud has many products and brand names. Here are the ones you’ll use in this workshop and how they relate:

Term What it is
GCP Google Cloud Platform — the overall cloud: compute, storage, networking.
Vertex AI Google’s ML platform — notebooks, training jobs, tuning, model hosting. Everything below lives under this umbrella.
Workbench Managed Jupyter notebooks that run on a Compute Engine VM. Your interactive environment.
Training & tuning jobs How you run code on Vertex AI hardware. You submit a script and a machine spec; Vertex AI provisions the VM, runs it, and shuts it down. The SDK offers several flavors — CustomTrainingJob (Ep 4–5), HyperparameterTuningJob (Ep 6) — and the CLI equivalent is gcloud ai custom-jobs (Ep 8).
Cloud Storage (GCS) Object storage for files. Similar to AWS S3.
Compute Engine Virtual machines you configure with CPUs, GPUs, or TPUs. Workbench and training jobs run on Compute Engine under the hood.
Gemini Google’s family of large language models, accessed through the Vertex AI API.

For a full list of terms, see the Glossary.

The notebook-as-controller pattern


The central idea of this workshop is simple: you work in a lightweight Vertex AI Workbench notebook — a small, cheap VM — and use the Vertex AI Python SDK to dispatch work to managed services. The notebook itself does not run heavy compute. Instead, it orchestrates:

  • Training jobs (Eps 4–5) — run your script on auto-provisioned GPU hardware, then shut down when complete.
  • Hyperparameter tuning jobs (Ep 6) — search a parameter space across parallel trials and return the best configuration.
  • Cloud Storage (Ep 3) — shared persistent storage for datasets, model artifacts, logs, and results.
  • Gemini API (Ep 7) — embeddings and generation for Retrieval-Augmented Generation (RAG) pipelines.

All of these are accessed via SDK calls from the notebook. This keeps costs low (the notebook VM stays small) and keeps your work reproducible (each job is a clean, logged run on dedicated hardware).

Architecture diagram showing a Workbench notebook at the center orchestrating four managed services via SDK calls: Training Jobs (Eps 4-5), HP Tuning Jobs (Ep 6), Cloud Storage (Ep 3), and Gemini API (Ep 7).
Notebook as controller — overview of workshop architecture
Callout

Console, notebooks, or CLI — your choice

This workshop uses the GCP web console and Workbench notebooks for most tasks because they’re visual and easy to follow for beginners. But nearly everything we do can also be done from the gcloud command-line tool — submitting training jobs, managing buckets, checking quotas. Episode 8 covers the CLI equivalents. If you prefer terminal-based workflows or need to automate jobs in scripts and CI/CD pipelines, that episode shows you how.

One important caveat: whether you use the console, notebooks, or CLI, resources you create (VMs, training jobs, endpoints) keep running and billing until you explicitly stop them. There’s no automatic shutdown. We cover cleanup habits in Episode 9, but the short version is: always check for running resources before you walk away.

Discussion

Your current setup

Think about how you currently run ML experiments:

  • What hardware do you use — laptop, HPC cluster, cloud?
  • What’s the biggest infrastructure pain point in your workflow (GPU access, environment setup, data transfer, cost)?
  • What would you most like to offload to a managed service?

Take 3–5 minutes to discuss with a partner or share in the workshop chat.

Key Points
  • Cloud platforms let you rent hardware on demand instead of buying or waiting for shared resources.
  • GCP organizes its ML/AI services under Vertex AI — notebooks, training jobs, tuning, and model hosting.
  • The notebook-as-controller pattern keeps your notebook cheap while offloading heavy training to dedicated Vertex AI jobs.
  • Everything in this workshop can also be done from the gcloud CLI (Episode 8).

Content from Notebooks as Controllers


Last updated on 2026-03-05 | Edit this page

Overview

Questions

  • How do you set up and use Vertex AI Workbench notebooks for machine learning tasks?
  • How can you manage compute resources efficiently using a “controller” notebook approach in GCP?

Objectives

  • Describe how Vertex AI Workbench notebooks fit into ML/AI workflows on GCP.
  • Set up a Jupyter-based Workbench Instance as a lightweight controller to manage compute tasks.
  • Configure a Workbench Instance with appropriate machine type, labels, and idle shutdown for cost-efficient orchestration.

Setting up our notebook environment


Google Cloud Workbench provides JupyterLab-based environments that can be used to orchestrate ML/AI workflows. In this workshop, we will use a Workbench Instance—the recommended option going forward, as other Workbench environments are being deprecated.

Workbench Instances come with JupyterLab 3 pre-installed and are configured with GPU-enabled ML frameworks (TensorFlow, PyTorch, etc.), making it easy to start experimenting without additional setup. Learn more in the Workbench Instances documentation.

Using the notebook as a controller


The notebook instance functions as a controller to manage more resource-intensive tasks. By selecting a modest machine type (e.g., n2-standard-2), you can perform lightweight operations locally in the notebook while using the Vertex AI Python SDK to launch compute-heavy jobs on larger machines (e.g., GPU-accelerated) when needed.

This approach minimizes costs while giving you access to scalable infrastructure for demanding tasks like model training, batch prediction, and hyperparameter tuning.

One practical advantage of Workbench notebooks: authentication is automatic. A Workbench VM inherits the permissions of its attached service account, so calls to Cloud Storage, Vertex AI, and the Gemini API work with no extra credential setup — no API keys or login commands needed. If you later run the same code from your laptop or an HPC cluster, you’ll need to set up credentials separately (see the GCP authentication docs). (Prefer working from a terminal? Episode 8: CLI Workflows covers how to do everything in this workshop using gcloud commands instead of notebooks.)

We will follow these steps to create our first Workbench Instance:

  • Open the Google Cloud Console (console.cloud.google.com) — this is the web dashboard where you manage all GCP resources. Search for “Workbench.”
  • Click the “Instances” tab (this is the supported path going forward).

2. Create a new Workbench Instance

Initial settings

  • Click Create New near the top of the Workbench page
  • Name: Use the convention lastname-purpose (e.g., doe-workshop). GCP resource names only allow lowercase letters, numbers, and hyphens. We’ll use a single instance for training, tuning, RAG, and more, so workshop is a good general-purpose label.
  • Region: Select us-central1. When we create a storage bucket in Episode 3, we’ll use the same region — keeping compute and storage co-located avoids cross-region transfer charges and keeps data access fast.
  • Zone: us-central1-a (or another zone in us-central1, like -b or -c)
    • If capacity or GPU availability is limited in one zone, switch to another zone in the same region.
  • NVIDIA T4 GPU: Leave unchecked for now
    • We will request GPUs for training jobs separately. Attaching here increases idle costs.
  • Apache Spark and BigQuery Kernels: Leave unchecked
    • BigQuery kernels let you run SQL analytics directly in a notebook, but we won’t need them in this workshop. Leave unchecked to avoid pulling extra container images.
  • Network in this project: If you’re working in a shared workshop environment, select the network provided by your administrator (shared environments typically do not allow using external or default networks). If using a personal GCP project, the default network is fine.
  • Network / Subnetwork: Leave as pre-filled. Notebook settings (part1)

Advanced settings: Details (tagging)

  • IMPORTANT: Open the “Advanced options” menu next.
    • Labels (required for cost tracking): Under the Details menu, add the following tags (all lowercase) so that you can track the total cost of your activity on GCP later:
      • name = firstname-lastname
      • purpose = workshop
Screenshot showing required tags for notebook
Required tags for notebook.

Advanced Settings: Environment

Leave environment settings at their defaults for this workshop. Workbench uses JupyterLab 3 by default with NVIDIA GPU drivers, CUDA, and common ML frameworks preinstalled. For future reference, you can optionally select JupyterLab 4, provide a custom Docker image, or specify a post-startup script (gs://path/to/script.sh) to auto-configure the instance at boot.

Advanced settings: Machine Type

  • Machine type: Select a small machine (e.g., n2-standard-2, ~ $0.07/hr) to act as the controller.
  • Set idle shutdown: To save on costs when you aren’t doing anything in your notebook, lower the default idle shutdown time to 60 (minutes).
Set Idle Shutdown
Enable Idle Shutdown

Advanced Settings: Disks

Leave disk settings at their defaults for this workshop. Each Workbench Instance has two disks: a boot disk (100 GB — holds the OS and libraries) and a data disk (150 GB default — holds your datasets and outputs). Both use Balanced Persistent Disks. Keep “Delete to trash” unchecked so deleted files free space immediately.

Rule of thumb: allocate ≈ 2× your dataset size for the data disk, and keep bulk data in Cloud Storage (gs://) rather than on local disk — PDs cost ~ $0.10/GB/month vs. ~ $0.02/GB/month for Cloud Storage.

Callout

Disk sizing and cost details

  • Boot disk: Rarely needs resizing. Increase to 150–200 GB only for large custom environments or multiple frameworks.
  • Data disk: Use SSD PD only for high-I/O workloads. Disks can be resized anytime without downtime, so start small and expand when needed.
  • Cost comparison: A 200 GB dataset costs ~ $24/month on a PD but only ~ $5/month in Cloud Storage.
  • Pricing: Persistent Disk pricing · Cloud Storage pricing

Advanced settings: Networking - External IP Access

  • Assign External IP address: Leave this option checked — you need an external IP.

Create notebook

  • Click Create to create the instance. Provisioning typically takes 3–5 minutes. You’ll see the status change from “Provisioning” to “Active” with a green checkmark. While waiting, work through the challenges below.
Challenge

Challenge 1: Notebook Roles

Your university provides different compute options: laptops, on-prem HPC, and GCP.

  • What role does a Workbench Instance notebook play compared to an HPC login node or a laptop-based JupyterLab?
  • Which tasks should stay in the notebook (lightweight control, visualization) versus being launched to larger cloud resources?

The notebook serves as a lightweight control plane.

  • Like an HPC login node, it is not meant for heavy computation.
  • Suitable for small preprocessing, visualization, and orchestrating jobs.
  • Resource-intensive tasks (training, tuning, batch jobs) should be submitted to scalable cloud resources (GPU/large VM instances) via the Vertex AI SDK.
Challenge

Challenge 2: Controller Cost Estimate

Your controller notebook uses an n2-standard-2 instance (~ $0.07/hr — see Compute for ML for other common machine types and costs).

  • Estimate the monthly cost if you use it 8 hours/day, 5 days/week, with idle shutdown enabled.
  • Compare that to leaving it running 24/7 for the same month.
  • With idle shutdown: 8 hrs × 5 days × 4 weeks = 160 hrs → 160 × $0.07$11.20/month
  • Running 24/7: 24 hrs × 30 days = 720 hrs → 720 × $0.07$50.40/month
  • Idle shutdown saves you ~ $39/month on a single small controller instance. The savings are even larger for bigger machine types.

Managing your instance

You don’t have to wait for idle shutdown — you can manually stop your instance anytime from the Workbench Instances list by selecting the checkbox and clicking Stop. To resume work, click Start. You only pay for compute while the instance is running (disk charges continue while stopped).

To permanently remove an instance, select it and click Delete. Full cleanup is covered in Episode 9.

Managing training and tuning with the controller notebook

In the following episodes, we will use the Vertex AI Python SDK (google-cloud-aiplatform) from this notebook to submit compute-heavy tasks on more powerful machines. Examples include:

  • Training a model on a GPU-backed instance.
  • Running hyperparameter tuning jobs managed by Vertex AI.

Here’s how the notebook, jobs, and storage connect:

Architecture diagram showing how a lightweight Workbench notebook uses the Vertex AI SDK to launch training jobs and HP tuning jobs on powerful GPUs, with all artifacts stored in GCS.
Training and tuning workflow

This pattern keeps costs low by running your notebook on a modest VM while only incurring charges for larger resources when they are actively in use.

Callout

You don’t need a notebook to use Vertex AI

We start with Vertex AI Workbench notebooks because they give you authenticated access to buckets, training jobs, and other GCP services out of the box — no credential setup required. The Console UI also lets you see and manage running jobs directly, which matters when you’re learning: accidentally submitting a duplicate training job is easy to spot and cancel in the Console, harder to notice from a terminal.

Episode 8 introduces the gcloud CLI once these concepts are familiar. Notebooks are not required for any of the workflows covered here — everything we do through the Python SDK can also be done from:

  • A plain Python script run from your terminal or an HPC scheduler.
  • The gcloud CLI (e.g., gcloud ai custom-jobs create ...).
  • A CI/CD pipeline (GitHub Actions, Cloud Build, etc.).

The real work happens in the training scripts and SDK calls — the notebook is just a convenient starting point.

Callout

Troubleshooting

  • VM stuck in “Provisioning” for more than 5 minutes? Try deleting the instance and re-creating it in a different zone within the same region (e.g., us-central1-b instead of us-central1-a).
  • Instance stopped unexpectedly? Check the idle shutdown setting — it may have timed out. Restart from the Instances list by clicking Start.
  • Can’t see the project or get permission errors? Ensure you’re signed into the correct Google account and that IAM permissions have propagated (this can take a few minutes after initial setup).

Load pre-filled Jupyter notebooks

Once your instance shows as “Active” (green checkmark), click Open JupyterLab. From the Launcher, select Python 3 (ipykernel) under Notebook to create a new notebook — we don’t need the TensorFlow or PyTorch kernels yet, as those are used in later episodes for training jobs.

Run the following command to clone the lesson repository. This contains pre-filled notebooks for each episode and the training scripts we’ll use later, so you won’t need to write boilerplate code from scratch.

SH

!git clone https://github.com/qualiaMachine/Intro_GCP_for_ML.git

Then, navigate to /Intro_GCP_for_ML/notebooks/03-Data-storage-and-access.ipynb to begin the next episode.

Key Points
  • Use a small Workbench Instance as a controller — delegate heavy training to Vertex AI jobs.
  • Workbench VMs inherit service account permissions automatically, simplifying authentication.
  • Choose the same region for your Workbench Instance and storage bucket to avoid extra transfer costs.
  • Apply labels to all resources for cost tracking, and enable idle auto-stop to avoid surprise charges.

Content from Data Storage and Access


Last updated on 2026-03-05 | Edit this page

Overview

Questions

  • How can I store and manage data effectively in GCP for Vertex AI workflows?
  • What are the advantages of Google Cloud Storage (GCS) compared to local or VM storage for machine learning projects?
  • How can I load data from GCS into a Vertex AI Workbench notebook?

Objectives

  • Explain data storage options in GCP for machine learning projects.
  • Set up a GCS bucket and upload data.
  • Read data directly from a GCS bucket into memory in a Vertex AI notebook.
  • Monitor storage usage and estimate costs.
  • Upload new files from the Vertex AI environment back to the GCS bucket.

ML/AI projects rely on data, making efficient storage and management essential. Google Cloud offers several storage options, but the most common for ML/AI workflows are Virtual Machine (VM) disks and Google Cloud Storage (GCS) buckets.

Consult your institution’s IT before handling sensitive data in GCP

As with AWS, do not upload restricted or sensitive data to GCP services unless explicitly approved by your institution’s IT or cloud security team. For regulated datasets (HIPAA, FERPA, proprietary), work with your institution to ensure encryption, restricted access, and compliance with policies.

Options for storage: VM Disks or GCS


What is a VM disk?

A VM disk is the storage volume attached to a Compute Engine VM or a Vertex AI Workbench notebook. It can store datasets and intermediate results, but it is tied to the lifecycle of the VM.

When to store data directly on a VM disk

  • Useful for small, temporary datasets processed interactively.
  • Data persists if the VM is stopped, but storage costs continue as long as the disk exists.
  • Not ideal for collaboration, scaling, or long-term dataset storage.
Callout

Limitations of VM disk storage

  • Scalability: Limited by disk size quota.
  • Sharing: Harder to share across projects or team members.
  • Cost: More expensive per GB compared to GCS for long-term storage.

What is a GCS bucket?

For most ML/AI workflows in GCP, Google Cloud Storage (GCS) buckets are recommended. A GCS bucket is a container in Google’s object storage service where you can store an essentially unlimited number of files. Data in GCS can be accessed from Vertex AI training jobs, Workbench notebooks, and other GCP services using a GCS URI (e.g., gs://your-bucket-name/your-file.csv). Think of GCS URIs as cloud file paths — you’ll use them throughout the workshop to reference data in training scripts, notebooks, and SDK calls.

Creating a GCS bucket


1. Sign in to Google Cloud Console

  • Go to console.cloud.google.com and log in with your credentials.
  • Select your project from the project dropdown at the top of the page. If you’re using the shared workshop project, the instructor will provide the project name.
  • In the search bar, type Storage.
  • Click Cloud Storage > Buckets.

3. Create a new bucket

  • Click Create bucket and configure the following settings:

  • Bucket name: Enter a globally unique name using the convention lastname-dataname (e.g., doe-titanic).

  • Labels: Add cost-tracking labels (same keys you used for the Workbench Instance in Episode 2, plus a dataset tag):

    • name = firstname-lastname
    • purpose = workshop
    • dataset = titanic

    In shared accounts, labels are mandatory.

  • Location: Choose Regionus-central1 (same region as your compute to avoid egress charges).

  • Storage class: Standard (best for active ML/AI workflows).

  • Access control: Uniform (simpler IAM-based permissions).

  • Protection: Leave default soft delete enabled; skip versioning and retention policies.

Click Create if everything looks good.

4. Upload files to the bucket

  • If you haven’t yet, download the data for this workshop (Right-click → Save as): data.zip
    • Extract the zip folder contents (Right-click → Extract all on Windows; double-click on macOS).
    • The zip contains the Titanic dataset — passenger information (age, class, fare, etc.) with a survival label. This is a classic binary classification task we’ll use for training in later episodes.
  • In the bucket dashboard, click Upload Files.
  • Select your Titanic CSVs (titanic_train.csv and titanic_test.csv) and upload.

Note the GCS URI for your data After uploading, click on a file and find its gs:// URI (e.g., gs://doe-titanic/titanic_test.csv). This URI will be used to access the data in your notebook.

Adjust bucket permissions


Your bucket exists, but your notebooks and training jobs don’t automatically have permission to use it. GCP follows the principle of least privilege — services only get the access you explicitly grant. In this section we’ll find the service account that Vertex AI uses and give it the right roles on your bucket.

Check your project ID

First, confirm which project your notebook is connected to. Run this cell in your Workbench notebook:

PYTHON

from google.cloud import storage
client = storage.Client()
print(client.project)

Copy the output — you’ll paste it into Cloud Shell commands below.

Callout

These commands run in Cloud Shell, not in a notebook

Open Cloud Shell — a browser-based terminal built into the Google Cloud Console (click the >_ icon in the top-right toolbar). Copy the commands below and paste them into that terminal.

Set your project

If Cloud Shell doesn’t already know your project, set it first:

SH

gcloud config set project YOUR_PROJECT_ID

Replace YOUR_PROJECT_ID with the project ID you copied above. For the shared MLM25 workshop the project ID is doit-rci-mlm25-4626.

Find your service account

When you create a GCP project, Google automatically provisions a Compute Engine default service account. This is the identity that Vertex AI Workbench notebooks and training jobs use when they call other GCP services (like Cloud Storage). By default this account may not have access to your bucket, so we need to grant it the right IAM roles explicitly.

First, look up the service account email:

SH

gcloud iam service-accounts list --filter="displayName:Compute Engine default service account" --format="value(email)"

This will return an email like 123456789-compute@developer.gserviceaccount.com. Copy it — you’ll paste it into the commands below.

Grant permissions

Now we give that service account the ability to read from and write to your bucket. Without these roles, your notebooks would get “Access Denied” errors when trying to load training data or save model artifacts.

Replace YOUR_BUCKET_NAME and YOUR_SERVICE_ACCOUNT, then run:

SH

# objectViewer — lets notebooks READ data (e.g., load CSVs for training)
gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \
  --member="serviceAccount:YOUR_SERVICE_ACCOUNT" \
  --role="roles/storage.objectViewer"

# objectCreator — lets training jobs WRITE outputs (e.g., saved models, logs)
gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \
  --member="serviceAccount:YOUR_SERVICE_ACCOUNT" \
  --role="roles/storage.objectCreator"

# objectAdmin — adds OVERWRITE and DELETE (only needed if you want to
# re-run jobs that replace existing files or clean up old artifacts)
gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \
  --member="serviceAccount:YOUR_SERVICE_ACCOUNT" \
  --role="roles/storage.objectAdmin"
Callout

gcloud storage vs. gsutil

Older tutorials often reference gsutil for Cloud Storage operations. Google now recommends gcloud storage as the primary CLI. Both work, but gcloud storage is actively maintained and consistent with the rest of the gcloud CLI.

Data transfer & storage costs


GCS costs are based on three things: storage class (how you store data), data transfer (moving data in or out of GCP), and operations (API requests). Operations are the individual actions your code performs against Cloud Storage — every time a notebook reads a file or a training job writes a model, that’s an API request.

  • Standard storage: ~ $0.02 per GB per month in us-central1.
  • Uploading data (ingress): Free.
  • Downloading data out of GCP (egress): ~ $0.12 per GB.
  • Cross-region access: ~ $0.01$0.02 per GB within North America.
  • GET requests (reading/downloading objects): ~ $0.004 per 10,000 requests.
  • PUT/POST requests (creating/uploading objects): ~ $0.05 per 10,000 requests.
  • Deleting data: Free (but Nearline/Coldline/Archive early-deletion fees apply).

For detailed pricing, see GCS Pricing Information.

Challenge

Challenge 1: Estimating Storage Costs

1. Estimate the total cost of storing 1 GB in GCS Standard storage (us-central1) for one month assuming: - Dataset read from the bucket 100 times within GCP (e.g., each training or tuning run fetches the data via a GET request — this stays inside Google’s network, so no egress charge) - Data is downloaded once out of GCP to your laptop at the end of the project (this does incur an egress charge)

2. Repeat the above calculation for datasets of 10 GB, 100 GB, and 1 TB (1024 GB).

Hints: Storage $0.02/GB/month, Egress $0.12/GB, GET requests negligible at this scale.

  1. 1 GB: Storage $0.02 + Egress $0.12 = $0.14
  2. 10 GB: $0.20 + $1.20 = $1.40
  3. 100 GB: $2.00 + $12.00 = $14.00
  4. 1 TB: $20.48 + $122.88 = $143.36

Accessing data from your notebook


Now that our bucket is set up, let’s use it from the Workbench notebook you created in the previous episode.

If you haven’t already cloned the repository, open JupyterLab from your Workbench Instance and run !git clone https://github.com/qualiaMachine/Intro_GCP_for_ML.git. Then navigate to /Intro_GCP_for_ML/notebooks/03-Data-storage-and-access.ipynb.

Set up GCP environment

If you haven’t already, initialize the storage client (same code from the permissions section earlier). The storage.Client() call creates a connection using the credentials already attached to your Workbench VM.

PYTHON

from google.cloud import storage
client = storage.Client()
print(client.project)

Reading data directly into memory

The code below downloads a CSV from your bucket and loads it into a pandas DataFrame. The blob.download_as_bytes() call pulls the file contents as raw bytes, and io.BytesIO wraps those bytes in a file-like object that pd.read_csv can read — no temporary file on disk needed.

PYTHON

import pandas as pd
import io

bucket_name = "doe-titanic" # ADJUST to your bucket's name

bucket = client.bucket(bucket_name)
blob = bucket.blob("titanic_train.csv")
train_data = pd.read_csv(io.BytesIO(blob.download_as_bytes()))
print(train_data.shape)
train_data.head()

The Titanic dataset contains passenger information (age, class, fare, etc.) and a binary survival label — we’ll train a classifier on this data in Episode 4.

PYTHON

train_data.info()
train_data.describe()
Callout

Alternative: reading directly with pandas

Vertex AI Workbench comes with gcsfs pre-installed, which lets pandas read GCS URIs directly — no BytesIO conversion needed:

PYTHON

train_data = pd.read_csv("gs://doe-titanic/titanic_train.csv")  # ADJUST bucket name

This is convenient for quick exploration. We use the storage.Client approach above because it gives you more control (listing blobs, checking sizes, uploading), which you’ll need in the sections that follow.

Callout

Common errors

  • Forbidden (403) — Your service account lacks permission. Revisit the Adjust bucket permissions section above.
  • NotFound (404) — The bucket name or file path is wrong. Double-check bucket_name and the blob path with client.list_blobs(bucket_name).
  • DefaultCredentialsError — The notebook cannot find credentials. Make sure you are running on a Vertex AI Workbench Instance (not a local machine).

Monitoring storage usage and costs


It’s good practice to periodically check how much storage your bucket is using. The code below sums up all object sizes.

PYTHON

total_size_bytes = 0
bucket = client.bucket(bucket_name)

for blob in client.list_blobs(bucket_name):
    total_size_bytes += blob.size

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{bucket_name}': {total_size_mb:.2f} MB")

PYTHON

storage_price_per_gb = 0.02   # $/GB/month for Standard storage
egress_price_per_gb = 0.12    # $/GB for internet egress (same-region transfers are free)
total_size_gb = total_size_bytes / (1024**3)

monthly_storage = total_size_gb * storage_price_per_gb
egress_cost = total_size_gb * egress_price_per_gb

print(f"Bucket size: {total_size_gb:.4f} GB")
print(f"Estimated monthly storage cost: ${monthly_storage:.4f}")
print(f"Estimated annual storage cost:  ${monthly_storage*12:.4f}")
print(f"One-time full download (egress) cost: ${egress_cost:.4f}")

Writing output files to GCS


PYTHON

# Create a sample file locally on the notebook VM
file_path = "/home/jupyter/Notes.txt"
with open(file_path, "w") as f:
    f.write("This is a test note for GCS.")

PYTHON

bucket = client.bucket(bucket_name)
blob = bucket.blob("docs/Notes.txt")
blob.upload_from_filename(file_path)
print("File uploaded successfully.")

List bucket contents:

PYTHON

for blob in client.list_blobs(bucket_name):
    print(blob.name)
Challenge

Challenge 2: Read and explore the test dataset

Read titanic_test.csv from your GCS bucket and display its shape. How does the test set compare to the training set in size and columns?

PYTHON

blob = client.bucket(bucket_name).blob("titanic_test.csv")
test_data = pd.read_csv(io.BytesIO(blob.download_as_bytes()))
print("Test shape:", test_data.shape)
print("Train shape:", train_data.shape)
print("Same columns?", list(test_data.columns) == list(train_data.columns))
test_data.head()

Both datasets share the same 12 columns (including Survived). The test set is a smaller held-out subset (179 rows vs 712 in training) — roughly an 80/20 split used for final evaluation after the model is trained.

Challenge

Challenge 3: Upload a summary CSV to GCS

Using train_data, compute the survival rate by passenger class (Pclass) and upload the result as results/survival_by_class.csv to your bucket.

PYTHON

summary = train_data.groupby("Pclass")["Survived"].mean().reset_index()
summary.columns = ["Pclass", "SurvivalRate"]
print(summary)

# Save locally then upload
summary.to_csv("/home/jupyter/survival_by_class.csv", index=False)
blob = client.bucket(bucket_name).blob("results/survival_by_class.csv")
blob.upload_from_filename("/home/jupyter/survival_by_class.csv")
print("Summary uploaded to GCS.")

Removing unused data (complete after the workshop)


After you are done using your data, remove unused files/buckets to stop costs.

You can delete files programmatically. Let’s clean up the notes file we uploaded earlier:

PYTHON

blob = client.bucket(bucket_name).blob("docs/Notes.txt")
blob.delete()
print("docs/Notes.txt deleted.")

Verify it’s gone:

PYTHON

for blob in client.list_blobs(bucket_name):
    print(blob.name)

For larger clean-up tasks, use the Cloud Console:

  • Delete files only – In your bucket, select the files you want to remove and click Delete.
  • Delete the bucket entirely – In Cloud Storage > Buckets, select your bucket and click Delete.

For a detailed walkthrough of cleaning up all workshop resources, see Episode 9: Resource Management and Cleanup.

Key Points
  • Use GCS for scalable, cost-effective, and persistent storage in GCP.
  • Persistent disks are suitable only for small, temporary datasets.
  • Load data from GCS into memory with storage.Client or directly via pd.read_csv("gs://...").
  • Periodically check storage usage and estimate costs to manage your GCS budget.
  • Track your storage, transfer, and request costs to manage expenses.
  • Regularly delete unused data or buckets to avoid ongoing costs.

Content from Training Models in Vertex AI: Intro


Last updated on 2026-03-05 | Edit this page

Overview

Questions

  • What are the differences between training locally in a Vertex AI notebook and using Vertex AI-managed training jobs?
  • How do custom training jobs in Vertex AI streamline the training process for various frameworks?
  • How does Vertex AI handle scaling across CPUs, GPUs, and TPUs?

Objectives

  • Understand the difference between local training in a Vertex AI Workbench notebook and submitting managed training jobs.
  • Learn to configure and use Vertex AI custom training jobs for different frameworks (e.g., XGBoost, PyTorch, SKLearn).
  • Understand scaling options in Vertex AI, including when to use CPUs, GPUs, or TPUs.
  • Compare performance, cost, and setup between custom scripts and pre-built containers in Vertex AI.
  • Conduct training with data stored in GCS and monitor training job status using the Google Cloud Console.
Callout

Cost awareness: training jobs

Training jobs bill per VM-hour while the job is running. An n1-standard-4 (CPU) costs ~ $0.19/hr; adding a T4 GPU brings the total to ~ $0.54/hr. Jobs automatically stop (and stop billing) when the script finishes. For a complete cost reference, see the Compute for ML page and the cost table in Episode 9.

Here’s the architecture we introduced in Episode 2 — your lightweight notebook orchestrates training jobs that run on separate, more powerful VMs, with all artifacts stored in GCS:

Architecture diagram showing how a lightweight Workbench notebook uses the Vertex AI SDK to launch training jobs and HP tuning jobs on powerful GPUs, with all artifacts stored in GCS.
Training and tuning workflow

Initial setup


1. Open pre-filled notebook

Navigate to /Intro_GCP_for_ML/notebooks/04-Training-models-in-VertexAI.ipynb to begin this notebook.

2. CD to instance home directory

To ensure we’re all in the same starting spot, change directory to your Jupyter home directory.

PYTHON

%cd /home/jupyter/

3. Set environment variables

This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.

  • PROJECT_ID: Identifies your GCP project.
  • REGION: Determines where training jobs run (choose a region close to your data).

PYTHON

from google.cloud import storage
client = storage.Client()
PROJECT_ID = client.project
REGION = "us-central1"
BUCKET_NAME = "doe-titanic" # ADJUST to your bucket's name
LAST_NAME = "DOE" # ADJUST to your last name or name
print(f"project = {PROJECT_ID}\nregion = {REGION}\nbucket = {BUCKET_NAME}")
Callout

How does storage.Client() know your project?

When you call storage.Client() without arguments, the library automatically discovers your credentials and project ID. This works because Vertex AI Workbench VMs run on Google Compute Engine, which provides a metadata server at a known internal address. The client library queries this server to retrieve the project ID and a service-account token — no keys or config files needed. If you ran the same code on your laptop, you would need to authenticate first with gcloud auth application-default login (see Episode 8 for details).

Testing train_xgboost.py locally in the notebook


Before submitting a managed training job to Vertex AI, let’s first examine and test the training script on our notebook VM. This ensures the code runs without errors before we spend money on cloud compute.

Callout

One script, two environments

A key design goal of train_xgboost.py is that the same script runs unchanged on your laptop, inside a Workbench notebook, and as a Vertex AI managed training job. Two patterns make this possible:

  1. GCS-aware I/O helpers (read_csv_any, save_model_any): These functions check whether a path starts with gs://. If it does, they use the google-cloud-storage client to read or write. If not, they use plain local file I/O. This means you can pass --train ./titanic_train.csv for a local test and --train=gs://my-bucket/titanic_train.csv for a cloud job without changing any code.

  2. AIP_MODEL_DIR environment variable: When Vertex AI runs a CustomTrainingJob with base_output_dir set, it injects AIP_MODEL_DIR (a gs:// path) into the container. The script reads this variable to decide where to save the model. Locally, the variable is unset, so it falls back to the current directory (.).

This “write once, run anywhere” approach means you can debug locally first (fast, free) and then submit the exact same script to Vertex AI (scalable, managed) with confidence.

Challenge

Understanding the XGBoost Training Script

Take a moment to review Intro_GCP_for_ML/scripts/train_xgboost.py. This is a standard XGBoost training script — it handles preprocessing, training, and saving a model. What makes it cloud-ready is that it also supports GCS (gs://) paths and adapts to Vertex AI conventions (e.g., AIP_MODEL_DIR), so the same script runs locally or as a managed training job without changes.

Try answering the following questions:

  1. Data preprocessing: What transformations are applied to the dataset before training?
  2. Training function: What does the train_model() function do? Why print the training time?
  3. Command-line arguments: What is the purpose of argparse in this script? How would you change the number of training rounds?
  4. Handling local vs. GCP runs: How does the script let you run the same code locally, in Workbench, or as a Vertex AI job? Which environment variable controls where the model artifact is written?
  5. Training and saving the model: What format is the dataset converted to before training, and why? How does the script save to a local path vs. a gs:// destination?

After reviewing, discuss any questions or observations with your group.

  1. Data preprocessing: The script fills missing values (Age with median, Embarked with mode), maps categorical fields to numeric (Sex → {male:1, female:0}, Embarked → {S:0, C:1, Q:2}), and drops non-predictive columns (PassengerId, Name, Ticket, Cabin).
  2. Training function: train_model() constructs and fits an XGBoost model with the provided parameters and prints wall-clock training time. Timing helps compare runs and make sensible scaling choices.
  3. Command-line arguments: argparse lets you set hyperparameters and file paths without editing code (e.g., --max_depth, --eta, --num_round, --train). To change rounds: python train_xgboost.py --num_round 200
  4. Handling local vs. GCP runs:
    • Input: You pass --train as either a local path (train.csv) or a GCS URI (gs://bucket/path.csv). The script automatically detects gs:// and reads the file directly from Cloud Storage using the Python client.
    • Output: If the environment variable AIP_MODEL_DIR is set (as it is in Vertex AI CustomJobs), the trained model is written there—often a gs:// path. Otherwise, the model is saved in the current working directory, which works seamlessly in both local and Workbench environments.
  5. Training and saving the model: The training data is converted into an XGBoost DMatrix, an optimized format that speeds up training and reduces memory use. The trained model is serialized with joblib. When saving locally, the file is written directly to disk. If saving to a Cloud Storage path (gs://...), the model is first saved to a temporary file and then uploaded to the specified bucket.

Before scaling training jobs onto managed resources, it’s essential to test your training script locally. This prevents wasting GPU/TPU time on bugs or misconfigured code. Skipping these checks can lead to silent data bugs, runtime blowups at scale, inefficient experiments, or broken model artifacts.

Sanity checks before scaling

  • Reproducibility – Do you get the same result each time? If not, set seeds controlling randomness.
  • Data loads correctly – Dataset loads without errors, expected columns exist, missing values handled.
  • Overfitting check – Train on a tiny dataset (e.g., 100 rows). If it doesn’t overfit, something is off.
  • Loss behavior – Verify training loss decreases and doesn’t diverge.
  • Runtime estimate – Get a rough sense of training time on small data before committing to large compute.
  • Memory estimate – Check approximate memory use to choose the right machine type.
  • Save & reload – Ensure model saves, reloads, and infers without errors.

Download data into notebook environment


Sometimes it’s helpful to keep a copy of data in your notebook VM for quick iteration, even though GCS is the preferred storage location. For example, downloading locally lets you test your training script without any GCS dependencies, making debugging faster. Once you’ve verified everything works, the actual Vertex AI job will read directly from GCS.

PYTHON

bucket = client.bucket(BUCKET_NAME)

blob = bucket.blob("titanic_train.csv")
blob.download_to_filename("/home/jupyter/titanic_train.csv")

print("Downloaded titanic_train.csv")

Local test run of train_xgboost.py


Running a quick test on the Workbench notebook VM is cheap — it’s a lightweight machine that costs only ~$0.19/hr. The real cost comes later when you launch managed training jobs with larger machines or GPUs. Think of your notebook as a low-cost controller: use it to catch bugs and verify logic before spending on cloud compute.

As you gain confidence, you can skip the notebook VM entirely and run these tests on your own laptop or lab machine — then submit jobs to Vertex AI via the gcloud CLI or Python SDK from anywhere (see Episode 8). That eliminates the VM cost altogether.

  • For large datasets, use a small representative sample of the total dataset when testing locally (i.e., just to verify that code is working and model overfits nearly perfectly after training enough epochs)
  • For larger models, use smaller model equivalents (e.g., 100M vs 7B params) when testing locally

PYTHON

# Pin the same XGBoost version used by the Vertex AI prebuilt container
# (xgboost-cpu.2-1) so local and cloud results are identical.
!pip install xgboost==2.1.0

PYTHON

# Training configuration parameters for XGBoost
MAX_DEPTH = 3         # maximum depth of each decision tree (controls model complexity)
ETA = 0.1             # learning rate (how much each tree contributes to the overall model)
SUBSAMPLE = 0.8       # fraction of training samples used per boosting round (prevents overfitting)
COLSAMPLE = 0.8       # fraction of features (columns) sampled per tree (adds randomness and diversity)
NUM_ROUND = 100       # number of boosting iterations (trees) to train

import time as t
start = t.time()

# Run the custom training script with hyperparameters defined above
!python Intro_GCP_for_ML/scripts/train_xgboost.py \
    --max_depth $MAX_DEPTH \
    --eta $ETA \
    --subsample $SUBSAMPLE \
    --colsample_bytree $COLSAMPLE \
    --num_round $NUM_ROUND \
    --train titanic_train.csv

print(f"Total local runtime: {t.time() - start:.2f} seconds")

Training on this small dataset should take <1 minute. Log runtime as a baseline. You should see the following output file:

  • xgboost-model — Serialized XGBoost model (Booster) via joblib; load with joblib.load() for reuse.

Evaluate the trained model on validation data


Now that we’ve trained and saved an XGBoost model, we want to do the most important sanity check:
Does this model make reasonable predictions on unseen data?

This step: 1. Loads the serialized model artifact that was written by train_xgboost.py 2. Loads a test set of Titanic passenger data 3. Applies the same preprocessing as training 4. Generates predictions 5. Computes simple accuracy

First, we’ll download the test data

PYTHON

blob = bucket.blob("titanic_test.csv")
blob.download_to_filename("titanic_test.csv")

print("Downloaded titanic_test.csv")

Then, we apply the same preprocessing function used by our training script before applying the model to our data.

Note: The import below treats the repo as a Python package. This works because we cloned the repo into /home/jupyter/ and the directory contains an __init__.py. If you get an ImportError, make sure your working directory is /home/jupyter/ (run %cd /home/jupyter/ first).

Note on test data: The training script internally splits its input data 80/20 for training and validation. The titanic_test.csv file we use here is a separate, held-out test set that was never seen during training — not even by the internal validation split. This gives us an unbiased measure of model performance.

PYTHON

import pandas as pd
import xgboost as xgb
import joblib
from sklearn.metrics import accuracy_score
from Intro_GCP_for_ML.scripts.train_xgboost import preprocess_data  # reuse same preprocessing

# Load test data
test_df = pd.read_csv("titanic_test.csv")

# Apply same preprocessing from training
X_test, y_test = preprocess_data(test_df)

# Load trained model from local file
model = joblib.load("xgboost-model")

# Predict on test data
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)
y_pred_binary = (y_pred > 0.5).astype(int)

# Compute accuracy
acc = accuracy_score(y_test, y_pred_binary)
print(f"Test accuracy: {acc:.3f}")

You should see test accuracy in the range of 0.78–0.82. If accuracy is significantly lower, double-check that the test data downloaded correctly and that the preprocessing matches the training script.

Challenge

Experiment with hyperparameters

Try changing NUM_ROUND to 200 and re-running the local training and evaluation cells above. Does accuracy improve? How does the runtime change? Then try MAX_DEPTH = 6. What happens to accuracy — does the model improve, or does it start overfitting?

Increasing NUM_ROUND from 100 to 200 may marginally improve accuracy but roughly doubles runtime. Increasing MAX_DEPTH from 3 to 6 lets trees capture more complex patterns but can lead to overfitting on a small dataset like Titanic — you may see training accuracy increase while test accuracy stays flat or drops. This is why testing hyperparameters locally before scaling is important.

Training via Vertex AI custom training job


Unlike “local” training using our notebook’s VM, this next approach launches a managed training job that runs on scalable compute. Vertex AI handles provisioning, scaling, logging, and saving outputs to GCS.

Which machine type to start with?

Start with a small CPU machine like n1-standard-4. Only scale up to GPUs/TPUs once you’ve verified your script. See Compute for ML for guidance.

PYTHON

MACHINE = 'n1-standard-4'

Creating a custom training job with the SDK

Reminder: We’re using the Python SDK from a notebook here, but the same aiplatform.CustomTrainingJob calls work identically in a standalone .py script, a shell session, or a CI pipeline. You can also submit jobs entirely from the command line with gcloud ai custom-jobs create. See the callout in Episode 2 for more details.

We’ll first initialize the Vertex AI platform with our environment variables. We’ll also set a RUN_ID and ARTIFACT_DIR to help store outputs.

PYTHON

from google.cloud import aiplatform
import datetime as dt
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/xgb/{RUN_ID}/"  # everything will live beside this
print(f"project = {PROJECT_ID}\nregion = {REGION}\nbucket = {BUCKET_NAME}\nartifact_dir = {ARTIFACT_DIR}")

# Staging bucket is only for the SDK's temp code tarball (aiplatform-*.tar.gz)
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}/.vertex_staging")

What does aiplatform.init() do?

aiplatform.init() sets session-wide defaults for the Vertex AI Python SDK. Every SDK call you make afterward (creating jobs, uploading models, querying metadata, etc.) will inherit these values so you don’t have to repeat them each time. The three arguments we pass here are:

Argument Purpose
project The Google Cloud project that owns (and is billed for) all Vertex AI resources you create.
location The region where jobs run and artifacts are stored (e.g., us-central1). Must match the region of any buckets or endpoints you reference.
staging_bucket A Cloud Storage path where the SDK automatically packages and uploads your training code as a tarball (e.g., aiplatform-2025-01-15-…-.tar.gz). The training VM downloads this tarball at startup to run your script. We point it to a .vertex_staging subfolder to keep these temporary archives separate from your real data and model artifacts.

You only need to call aiplatform.init() once per notebook or script session. If you ever need to override a default for a single call (e.g., run a job in a different region), you can pass the argument directly to that method and it will take precedence.

A CustomTrainingJob is the Vertex AI SDK object that ties together three things: your training script, a container image to run it in, and metadata such as a display name. Think of it as a reusable job definition — it doesn’t start any compute by itself. Only when you call job.run() (next step) does Vertex AI actually provision a VM, ship your code to it, and execute the script.

The code below creates a CustomTrainingJob that points to train_xgboost.py, uses Google’s prebuilt XGBoost training container (which already includes common dependencies like google-cloud-storage), and sets a display_name for tracking the job in the Vertex AI console.

Tip: If your script needs packages not included in the prebuilt container, you can pass a requirements list to CustomTrainingJob (e.g., requirements=["scikit-learn>=1.3"]).

Prebuilt containers for training

Vertex AI provides prebuilt Docker container images for model training. These containers are organized by machine learning frameworks and framework versions and include common dependencies that you might want to use in your training code. To learn more about prebuilt training containers, see Prebuilt containers for custom training.

PYTHON


job = aiplatform.CustomTrainingJob(
    display_name=f"{LAST_NAME}_xgb_{RUN_ID}",
    script_path="Intro_GCP_for_ML/scripts/train_xgboost.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.2-1:latest",
)

Version alignment: Notice that the container tag xgboost-cpu.2-1 matches the xgboost==2.1.0 we installed locally. This is intentional — pinning the same library version in both environments ensures that local and cloud training produce identical results given the same data and random seed.

Finally, this next block launches the custom training job on Vertex AI using the configuration defined earlier. We won’t be charged for our selected MACHINE until we run the below code using job.run(). For an n1-standard-4 running 2–5 minutes, expect a cost of roughly $0.01$0.02 — negligible, but good to be aware of as you scale to larger machines. This marks the point when our script actually begins executing remotely on the Vertex training infrastructure. Once job.run() is called, Vertex handles packaging your training script, transferring it to the managed training environment, provisioning the requested compute instance, and monitoring the run. The job’s status and logs can be viewed directly in the Vertex AI Console under Training → Custom jobs.

If you need to cancel or modify a job mid-run, you can do so from the console or via the SDK by calling job.cancel(). When the job completes, Vertex automatically tears down the compute resources so you only pay for the active training time.

  • The args list passes command-line parameters directly into your training script, including hyperparameters and the path to the training data in GCS.
  • replica_count=1 means we run a single training worker. Increase this for distributed training across multiple machines (e.g., data-parallel training with large datasets).
  • base_output_dir specifies where all outputs (model, metrics, logs) will be written in Cloud Storage.
  • machine_type controls the compute resources used for training.
  • When sync=True, the notebook waits until the job finishes before continuing, making it easier to inspect results immediately after training.

PYTHON

job.run(
    args=[
        f"--train=gs://{BUCKET_NAME}/titanic_train.csv",
        f"--max_depth={MAX_DEPTH}",
        f"--eta={ETA}",
        f"--subsample={SUBSAMPLE}",
        f"--colsample_bytree={COLSAMPLE}",
        f"--num_round={NUM_ROUND}",
    ],
    replica_count=1,
    machine_type=MACHINE, # MACHINE variable defined above; adjust to something more powerful when needed
    base_output_dir=ARTIFACT_DIR,  # sets AIP_MODEL_DIR for your script
    sync=True,
)

print("Model + logs folder:", ARTIFACT_DIR)

This launches a managed training job with Vertex AI. It should take 2-5 minutes for the training job to complete.

Understanding the training output message

After your job finishes, you may see a message like: Training did not produce a Managed Model returning None. This is expected when running a CustomTrainingJob without specifying deployment parameters. Vertex AI supports two modes:

  • CustomTrainingJob (research/development) – You control training and save models/logs to Cloud Storage via AIP_MODEL_DIR. This is ideal for experimentation and cost control.
  • CustomTrainingJob with model registration (for deployment) – You include model_serving_container_image_uri and model_display_name, and Vertex automatically registers a Managed Model in the Model Registry for deployment to an endpoint.

In our setup, we’re intentionally using the simpler CustomTrainingJob path without model registration. Your trained model is safely stored under your specified artifact directory (e.g., gs://{BUCKET_NAME}/artifacts/xgb/{RUN_ID}/), and you can later register or deploy it manually when ready.

Monitoring training jobs in the Console


Why do I see both a Training Pipeline and a Custom Job? Under the hood, CustomTrainingJob.run() creates a TrainingPipeline resource, which in turn launches a CustomJob to do the actual compute work. This is normal — the pipeline is a thin wrapper that manages job lifecycle and (optionally) model registration. You can monitor progress from either view, but Custom Jobs shows the most useful details (logs, machine type, status).

  1. Go to the Google Cloud Console.
  2. Navigate to Vertex AI > Training > Custom Jobs.
  3. Click on your job name to see status, logs, and output model artifacts.
  4. Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).

Visit the console to verify it’s running.

Navigate to Vertex AI > Training > Custom Jobs in the Google Cloud Console to view your running or completed jobs.

If your job fails

Job failures are common when first getting started. Here’s how to debug:

  1. Check the logs first. In the Console, click your job name → Logs tab. The error message is usually near the bottom.
  2. Common failure modes:
    • Quota exceeded — Your project may not have enough quota for the requested machine type. Check IAM & Admin > Quotas.
    • Script error — A bug in your training script. The traceback will appear in the logs. Fix the bug and re-run locally before resubmitting.
    • Wrong container — Mismatched framework version or CPU/GPU container. Verify your container_uri.
    • Permission denied on GCS — The training service account can’t access your bucket. Check bucket permissions.
  3. Re-test locally with the same arguments before resubmitting to avoid burning compute time on the same error.

Training artifacts


After the training run completes, we can manually view our bucket using the Google Cloud Console or run the below code.

PYTHON

total_size_bytes = 0

for blob in client.list_blobs(BUCKET_NAME):
    total_size_bytes += blob.size
    print(blob.name)

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{BUCKET_NAME}': {total_size_mb:.2f} MB")

Training Artifacts → ARTIFACT_DIR

This is your intended output location, set via base_output_dir.
It contains everything your training script explicitly writes. In our case, this includes:

  • {BUCKET_NAME}/artifacts/xgb/{RUN_ID}/model/xgboost-model — Serialized XGBoost model (Booster) saved via joblib; reload later with joblib.load() for reuse or deployment.

System-Generated Staging Files

You’ll also notice files under .vertex_staging/ — one timestamped tarball per job submission:

.vertex_staging/aiplatform-2026-03-04-05:51:20.248-aiplatform_custom_trainer_script-0.1.tar.gz
.vertex_staging/aiplatform-2026-03-04-05:53:28.009-aiplatform_custom_trainer_script-0.1.tar.gz
...

Each time you call job.run(...), the SDK packages your training script into a .tar.gz, uploads it here, and the training VM downloads it at startup. These accumulate quickly — the example above shows 19 archives from a single day of iteration. They are safe to delete once the job finishes, and you can automate cleanup with Object Lifecycle Management rules (e.g., auto-delete objects in .vertex_staging/ after 7 days).

To delete all staging files now, run:

PYTHON

!gsutil -m rm -r gs://{BUCKET_NAME}/.vertex_staging/

This won’t affect your model artifacts under artifacts/.

Evaluate the trained model stored on GCS


Now let’s compare the model produced by our Vertex AI job to the one we trained locally. This time, instead of loading from the local disk, we’ll load both the test data and model artifact directly from GCS into memory — the recommended approach for production workflows.

PYTHON

import io

# Load test data directly from GCS into memory
bucket = client.bucket(BUCKET_NAME)
blob = bucket.blob("titanic_test.csv")
test_df = pd.read_csv(io.BytesIO(blob.download_as_bytes()))

# Apply same preprocessing logic used during training
X_test, y_test = preprocess_data(test_df)

# Load the model artifact from GCS
MODEL_BLOB_PATH = f"artifacts/xgb/{RUN_ID}/model/xgboost-model"
model_blob = bucket.blob(MODEL_BLOB_PATH)
model_bytes = model_blob.download_as_bytes()
model = joblib.load(io.BytesIO(model_bytes))

# Run predictions and compute accuracy
dtest = xgb.DMatrix(X_test)
y_pred_prob = model.predict(dtest)
y_pred = (y_pred_prob >= 0.5).astype(int)

acc = accuracy_score(y_test, y_pred)
print(f"Test accuracy (model from Vertex job): {acc:.3f}")
Challenge

Compare local vs. Vertex AI accuracy

Compare the test accuracy from your local training run with the accuracy from the Vertex AI job. Are they the same? Why or why not?

The two accuracy values should be very close (within ~1–2 percentage points) but may not be byte-for-byte identical, even though both runs use the same script, hyperparameters, data, and random seed (seed=42).

Why? The subsample=0.8 and colsample_bytree=0.8 settings randomly sample rows and columns each boosting round. A seed guarantees determinism only within the exact same library version, NumPy build, and BLAS/LAPACK backend. The Workbench notebook VM and the prebuilt training container ship different underlying numerical libraries (e.g., OpenBLAS vs. MKL), so even with identical XGBoost versions the random sampling sequence can diverge slightly — producing a different model and therefore a small accuracy difference.

If you want exact reproducibility, set subsample=1.0 and colsample_bytree=1.0 (no random sampling) or accept that minor variation across environments is normal and expected in practice.

Challenge

Explore job logs in the Console

Navigate to Vertex AI > Training > Custom Jobs in the Google Cloud Console. Find your most recent job and click on it. Can you locate:

  1. The Logs tab showing your script’s print() output?
  2. The training time printed by train_model()?
  3. The output artifact path?
  1. Click your job name, then select the Logs tab (or View logs link). Your script’s print() statements — including train/val sizes, training time, and model save path — appear in the log stream.
  2. Look for the line Training time: X.XX seconds in the logs. This comes from the train_model() function in train_xgboost.py.
  3. The artifact path is shown in the log line Model saved to gs://... and also appears in the job details panel under output configuration.

Looking ahead: when training takes too long

The Titanic dataset is tiny, so our job finishes in minutes. In your real work, you’ll encounter datasets and models where a single training run takes hours or days. When that happens, Vertex AI gives you two main levers:

Option 1: Upgrade to more powerful machine types - Use a larger machine or add GPUs (e.g., T4, V100, A100). This is the simplest approach and works well for datasets under ~10 GB.

Option 2: Use distributed training with multiple replicas - Split the dataset across replicas with synchronized gradient updates. This becomes worthwhile when datasets exceed 10–50 GB or single-machine training takes more than 10 hours.

We’ll explore both options hands-on in the next episode when we train a PyTorch neural network with GPU acceleration.

Key Points
  • Environment initialization: Use aiplatform.init() to set defaults for project, region, and bucket.
  • Local vs managed training: Test locally before scaling into managed jobs.
  • Custom jobs: Vertex AI lets you run scripts as managed training jobs using pre-built or custom containers.
  • Scaling: Start small, then scale up to GPUs or distributed jobs as dataset/model size grows.
  • Monitoring: Track job logs and artifacts in the Vertex AI Console.

Content from Training Models in Vertex AI: PyTorch Example


Last updated on 2026-03-05 | Edit this page

Overview

Questions

  • When should you consider a GPU (or TPU) instance for PyTorch training in Vertex AI, and what are the trade‑offs for small vs. large workloads?
  • How do you launch a script‑based training job and write all artifacts (model, metrics, logs) next to each other in GCS without deploying a managed model?

Objectives

  • Prepare the Titanic dataset and save train/val arrays to compressed .npz files in GCS.
  • Submit a CustomTrainingJob that runs a PyTorch script and explicitly writes outputs to a chosen gs://…/artifacts/.../ folder.
  • Co‑locate artifacts: model.pt (or .joblib), metrics.json, eval_history.csv, and training.log for reproducibility.
  • Choose CPU vs. GPU instances sensibly; understand when distributed training is (not) worth it.

Initial setup


1. Open pre-filled notebook

Navigate to /Intro_GCP_for_ML/notebooks/05-Training-models-in-VertexAI-GPUs.ipynb to begin this notebook. Select the PyTorch environment (kernel). Local PyTorch is only needed for local tests. Your Vertex AI job uses the container specified by container_uri (e.g., pytorch-xla.2-4.py310 for CPU or pytorch-gpu.2-4.py310 for GPU), so it brings its own framework at run time.

2. CD to instance home directory

To ensure we’re all in the same starting spot, change directory to your Jupyter home directory.

PYTHON

%cd /home/jupyter/

3. Set environment variables

This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.

PYTHON

from google.cloud import aiplatform, storage
client = storage.Client()
PROJECT_ID = client.project
REGION = "us-central1"
BUCKET_NAME = "doe-titanic" # ADJUST to your bucket's name
LAST_NAME = 'DOE' # ADJUST to your last name. Since we're in a shared account environment, this will help us track down jobs in the Console

print(f"project = {PROJECT_ID}\nregion = {REGION}\nbucket = {BUCKET_NAME}")

# initializes the Vertex AI environment with the correct project and location. Staging bucket is used for storing the compressed software that's packaged for training/tuning jobs.
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}/.vertex_staging") # store tar balls in staging folder 

Prepare data as .npz


Unlike the XGBoost script from Episode 4 (which handles preprocessing internally from raw CSV), our PyTorch script expects pre-processed NumPy arrays. We’ll prepare those here and save them as .npz files.

Why .npz? NumPy’s .npz files are compressed binary containers that can store multiple arrays (e.g., features and labels) together in a single file:

  • Compact & fast: smaller than CSV, and one file can hold multiple arrays (X_train, y_train).
  • Cloud-friendly: each .npz is a single GCS object — one network call to read instead of streaming many small files, reducing latency and egress costs.
  • Vertex AI integration: when you launch a training job, GCS objects are automatically staged to the job VM’s local scratch disk, so np.load(...) reads from local storage at runtime.
  • Reproducible: unlike CSV, .npz preserves exact dtypes and shapes across environments.

PYTHON

import pandas as pd
import io
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load Titanic CSV (from local or GCS you've already downloaded to the notebook)
bucket = client.bucket(BUCKET_NAME)
blob = bucket.blob("titanic_train.csv")
df = pd.read_csv(io.BytesIO(blob.download_as_bytes()))

# Minimal preprocessing to numeric arrays
sex_enc = LabelEncoder().fit(df["Sex"])
df["Sex"] = sex_enc.transform(df["Sex"])
df["Embarked"] = df["Embarked"].fillna("S")
emb_enc = LabelEncoder().fit(df["Embarked"])
df["Embarked"] = emb_enc.transform(df["Embarked"])
df["Age"] = df["Age"].fillna(df["Age"].median())
df["Fare"] = df["Fare"].fillna(df["Fare"].median())

X = df[["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].values
y = df["Survived"].values

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42)

np.savez("/home/jupyter/train_data.npz", X_train=X_train, y_train=y_train)
np.savez("/home/jupyter/val_data.npz",   X_val=X_val,   y_val=y_val)

We can then upload the files to our GCS bucket.

PYTHON

# Upload to GCS
bucket.blob("data/train_data.npz").upload_from_filename("/home/jupyter/train_data.npz")
bucket.blob("data/val_data.npz").upload_from_filename("/home/jupyter/val_data.npz")
print("Uploaded: gs://%s/data/train_data.npz and val_data.npz" % BUCKET_NAME)

Verify the upload by listing your bucket contents (same pattern as Episode 3):

PYTHON

for blob in client.list_blobs(BUCKET_NAME):
    print(blob.name)

Minimal PyTorch training script (train_nn.py) - local test


Running a quick test on the Workbench notebook VM is cheap — it’s a lightweight machine that costs only ~$0.19/hr. The real cost comes later when you launch managed training jobs with larger machines or GPUs. Think of your notebook as a low-cost controller: use it to catch bugs and verify logic before spending on cloud compute.

As you gain confidence, you can skip the notebook VM entirely and run these tests on your own laptop or lab machine — then submit jobs to Vertex AI via the gcloud CLI or Python SDK from anywhere (see Episode 8). That eliminates the VM cost altogether.

  • For large datasets, use a small representative sample of the total dataset when testing locally (i.e., just to verify that code is working and model overfits nearly perfectly after training enough epochs)
  • For larger models, use smaller model equivalents (e.g., 100M vs 7B params) when testing locally

Find this file in our repo: Intro_GCP_for_ML/scripts/train_nn.py. It does three things: 1) loads .npz from local or GCS paths (transparently handles both) 2) trains a small neural network (a 3-layer MLP) with early stopping 3) writes all outputs side‑by‑side (model + metrics + eval history + training.log) to the folder specified by the AIP_MODEL_DIR environment variable (set automatically by Vertex AI via base_output_dir), falling back to the current directory for local runs.

Callout

What’s inside train_nn.py? (Quick reference)

You don’t need to understand every line of the PyTorch code for this workshop — the focus is on how to package and run any training script on Vertex AI. That said, here’s a quick orientation:

  • GCS helpers (top of file): read_npz_any() and save_*_any() functions detect gs:// paths and use the GCS Python client automatically. This is the key pattern that makes the same script work locally and in the cloud.
  • AIP_MODEL_DIR: Vertex AI sets this environment variable to tell your script where to write artifacts. The script reads it at the top of main().
  • Model: A small feedforward network (TitanicNet) — the architecture details aren’t important for this lesson.
  • Early stopping: Training halts when validation loss stops improving (controlled by --patience). This saves compute time and cost on cloud jobs.

To test this code, we can run the following:

PYTHON

# configure training hyperparameters to use in all model training runs downstream
MAX_EPOCHS = 500
LR =  0.001
PATIENCE = 50

# local training run
import time as t

start = t.time()

# Example: run your custom training script with args
!python /home/jupyter/Intro_GCP_for_ML/scripts/train_nn.py \
    --train /home/jupyter/train_data.npz \
    --val /home/jupyter/val_data.npz \
    --epochs $MAX_EPOCHS \
    --learning_rate $LR \
    --patience $PATIENCE

print(f"Total local runtime: {t.time() - start:.2f} seconds")
Callout

NumPy version mismatch?

If the cell above fails with a NumPy error (e.g., module 'numpy' has no attribute ...), run this fix and then re-run the training cell:

PYTHON

!pip install --upgrade --force-reinstall "numpy<2"

The PyTorch kernel occasionally ships with NumPy 2.x, which has breaking API changes.

Reproducibility test

Without reproducibility, it’s impossible to gain reliable insights into the efficacy of our methods. An essential component of applied ML/AI is ensuring our experiments are reproducible. Let’s first rerun the same code we did above to verify we get the same result.

  • Take a look near the top of Intro_GCP_for_ML/scripts/train_nn.py where we are setting multiple numpy and torch seeds to ensure reproducibility.

PYTHON

import time as t

start = t.time()

# Example: run your custom training script with args
!python /home/jupyter/Intro_GCP_for_ML/scripts/train_nn.py \
    --train /home/jupyter/train_data.npz \
    --val /home/jupyter/val_data.npz \
    --epochs $MAX_EPOCHS \
    --learning_rate $LR \
    --patience $PATIENCE

print(f"Total local runtime: {t.time() - start:.2f} seconds")

Please don’t use cloud resources for code that is not reproducible!

Evaluate the locally trained model on the validation data

Let’s load the model we just trained and run it against the validation set. This confirms the saved weights produce the expected accuracy before we move to cloud training.

PYTHON

import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet

# load validation data
d = np.load("/home/jupyter/val_data.npz")
X_val, y_val = d["X_val"], d["y_val"]

# Convert to PyTorch tensors with the dtypes the model expects:
#   - Features → float32: neural-network layers (Linear, BatchNorm) operate on floats.
#   - Labels   → long (int64): nn.BCEWithLogitsLoss (and most classification losses)
#     expect integer class labels, not floats.
X_val_t = torch.tensor(X_val, dtype=torch.float32)
y_val_t = torch.tensor(y_val, dtype=torch.long)

# rebuild model and load weights
m = TitanicNet()
state = torch.load("/home/jupyter/model.pt", map_location="cpu", weights_only=True)
m.load_state_dict(state)
m.eval()

with torch.no_grad():
    probs = m(X_val_t).squeeze(1)                # [N], sigmoid outputs in (0,1)
    preds_t = (probs >= 0.5).long()              # [N] int64
    correct = (preds_t == y_val_t).sum().item()
    acc = correct / y_val_t.shape[0]

print(f"Local model val accuracy: {acc:.4f}")

We should see an accuracy that matches our best epoch in the local training run. Note that in our setup, early stopping is based on validation loss; not accuracy.

Launch the training job


In the previous episode, we trained an XGBoost model using Vertex AI’s CustomTrainingJob interface. Here, we’ll do the same for a PyTorch neural network. The structure is nearly identical — we define a training script, select a prebuilt container (CPU or GPU), and specify where to write all outputs in Google Cloud Storage (GCS). The main difference is that PyTorch requires us to save our own model weights and metrics inside the script rather than relying on Vertex to package a model automatically.

Set training job configuration vars

Callout

Check supported container versions

Container URI format matters. The container must be registered for python package training (used by CustomTrainingJob). Use the pytorch-xla variant with a Python-version suffix — e.g., pytorch-xla.2-4.py310:latest. The pytorch-cpu and pytorch-gpu variants may not be registered for python package training.

Google periodically retires older versions. If you see an INVALID_ARGUMENT error about an unsupported image, check the current list at Prebuilt containers for training and update the version number.

PYTHON

import datetime as dt
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch/{RUN_ID}"
IMAGE = 'us-docker.pkg.dev/vertex-ai/training/pytorch-xla.2-4.py310:latest'
MACHINE = "n1-standard-4" # CPU fine for small datasets

print(f"RUN_ID = {RUN_ID}\nARTIFACT_DIR = {ARTIFACT_DIR}\nMACHINE = {MACHINE}")

Init the training job with configurations

PYTHON

# init job (this does not consume any resources)
DISPLAY_NAME = f"{LAST_NAME}_pytorch_nn_{RUN_ID}" 
print(DISPLAY_NAME)

# init the job. This does not consume resources until we run job.run()
job = aiplatform.CustomTrainingJob(
    display_name=DISPLAY_NAME,
    script_path="Intro_GCP_for_ML/scripts/train_nn.py",
    container_uri=IMAGE)

Run the job, paying for our MACHINE on-demand.

PYTHON

job.run(
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        f"--epochs={MAX_EPOCHS}",
        f"--learning_rate={LR}",
        f"--patience={PATIENCE}",
    ],
    replica_count=1,
    machine_type=MACHINE,
    base_output_dir=ARTIFACT_DIR,  # sets AIP_MODEL_DIR used by your script
    sync=True,
)
print("Artifacts folder:", ARTIFACT_DIR)

Monitoring training jobs in the Console

Why do I see both a Training Pipeline and a Custom Job? Under the hood, CustomTrainingJob.run() creates a TrainingPipeline resource, which in turn launches a CustomJob to do the actual compute work. This is normal — the pipeline is a thin wrapper that manages job lifecycle. You can monitor progress from either view, but Custom Jobs shows the most useful details (logs, machine type, status).

  1. Go to the Google Cloud Console.
  2. Navigate to Vertex AI > Training > Custom Jobs.
  3. Click on your job name to see status, logs, and output model artifacts.
  4. Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).

After the job completes, your training script writes several output files to the GCS artifact directory. Here’s what you’ll find in gs://…/artifacts/pytorch/<RUN_ID>/:

  • model.pt — PyTorch weights (state_dict).
  • metrics.json — final val loss, hyperparameters, dataset sizes, device, model URI.
  • eval_history.csv — per‑epoch validation loss (for plots/regression checks).
  • training.log — complete stdout/stderr for reproducibility and debugging.

Evaluate the Vertex-trained model on the validation data

We can check our work to see if this model gives the same result as our “locally” trained model above.

To follow best practices, we will simply load this model into memory from GCS.

PYTHON

import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet

# -----------------
# download model.pt straight into memory and load weights
# -----------------

ARTIFACT_PREFIX = f"artifacts/pytorch/{RUN_ID}/model"

MODEL_PATH = f"{ARTIFACT_PREFIX}/model.pt"
model_blob = bucket.blob(MODEL_PATH)
model_bytes = model_blob.download_as_bytes()

# load from bytes
model_pt = io.BytesIO(model_bytes)

# rebuild model and load weights
state = torch.load(model_pt, map_location="cpu", weights_only=True)
m = TitanicNet()
m.load_state_dict(state)
m.eval();

Evaluate using the same pattern from the CPU evaluation section above — load validation data from GCS, run predictions, and check accuracy. The results should match the CPU job since we set random seeds.

PYTHON

# Read validation data from GCS (reuses val data from local eval above)
VAL_PATH = "data/val_data.npz"
val_blob = bucket.blob(VAL_PATH)
val_bytes = val_blob.download_as_bytes()
d = np.load(io.BytesIO(val_bytes))
X_val, y_val = d["X_val"], d["y_val"]
X_val_t = torch.tensor(X_val, dtype=torch.float32)   # features → float for network layers
y_val_t = torch.tensor(y_val, dtype=torch.long)       # labels → int64 for loss function

with torch.no_grad():
    probs = m(X_val_t).squeeze(1)
    preds_t = (probs >= 0.5).long()
    correct = (preds_t == y_val_t).sum().item()
    acc = correct / y_val_t.shape[0]

print(f"Vertex model val accuracy: {acc:.4f}")

GPU-Accelerated Training on Vertex AI


Our CPU job above worked fine for this small dataset. In practice, you’d switch to a GPU when training takes too long on CPU — typically with larger models (millions of parameters) or larger datasets (hundreds of thousands of rows). For the Titanic dataset, the GPU will likely be slower end-to-end due to provisioning overhead, but we’ll run it here to learn the workflow.

The changes from CPU to GPU are minimal — this is one of the advantages of Vertex AI’s container-based approach:

  • The container image switches to the GPU-enabled version (pytorch-gpu.2-4.py310:latest), which includes CUDA and cuDNN.
  • The machine type (n1-standard-8) defines CPU and memory resources, while we add a GPU accelerator (NVIDIA_TESLA_T4, NVIDIA_L4, etc.). For guidance on selecting a machine type and accelerator, visit the Compute for ML resource.
  • The training script, arguments, and artifact handling all stay the same.
Callout

GPU quota unavailable?

If your job fails with a quota error, don’t worry — re-run using the CPU configuration from the previous section. The results will be identical, just slower. GPU quota requests can take 1–3 business days to process.

PYTHON

from google.cloud import aiplatform

RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")

# GCS folder where ALL artifacts (model.pt, metrics.json, eval_history.csv, training.log) will be saved.
# Your train_nn.py writes to AIP_MODEL_DIR, and base_output_dir (below) sets that variable for the job.
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch/{RUN_ID}"

# ---- Container image ----
# Use a prebuilt TRAINING image that has PyTorch + CUDA. This enables GPU at runtime.
IMAGE = "us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-4.py310:latest"

# ---- Machine vs Accelerator (important!) ----
# machine_type = the VM's CPU/RAM shape. It is NOT a GPU by itself.
# We often pick n1-standard-8 as a balanced baseline for single-GPU jobs.
MACHINE = "n1-standard-8"

# To actually get a GPU, you *attach* one via accelerator_type + accelerator_count.
# Common choices:
#   "NVIDIA_TESLA_T4" (cost-effective, widely available)
#   "NVIDIA_L4"       (newer, CUDA 12.x, good perf/$)
#   "NVIDIA_TESLA_V100" / "NVIDIA_A100_40GB" (high-end, pricey)
ACCELERATOR_TYPE = "NVIDIA_TESLA_T4"
ACCELERATOR_COUNT = 1  # Increase (2,4) only if your code supports multi-GPU (e.g., DDP)

# Alternative (GPU-bundled) machines:
# If you pick an A2 type like "a2-highgpu-1g", it already includes 1 A100 GPU.
# In that case, you can omit accelerator_type/accelerator_count entirely.
# Example:
# MACHINE = "a2-highgpu-1g"
# (and then remove the accelerator_* kwargs in job.run)

print(
    "RUN_ID =", RUN_ID,
    "\nARTIFACT_DIR =", ARTIFACT_DIR,
    "\nIMAGE =", IMAGE,
    "\nMACHINE =", MACHINE,
    "\nACCELERATOR_TYPE =", ACCELERATOR_TYPE,
    "\nACCELERATOR_COUNT =", ACCELERATOR_COUNT,
)

DISPLAY_NAME = f"{LAST_NAME}_pytorch_nn_{RUN_ID}"

job = aiplatform.CustomTrainingJob(
    display_name=DISPLAY_NAME,
    script_path="Intro_GCP_for_ML/scripts/train_nn.py",  # Your PyTorch trainer
    container_uri=IMAGE,  # Must be a *training* image (not prediction)
)

job.run(
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        f"--epochs={MAX_EPOCHS}",
        f"--learning_rate={LR}",
        f"--patience={PATIENCE}",
    ],
    replica_count=1,                 # One worker (simple, cheaper)
    machine_type=MACHINE,            # CPU/RAM shape of the VM (no GPU implied)
    accelerator_type=ACCELERATOR_TYPE,   # Attaches the selected GPU model
    accelerator_count=ACCELERATOR_COUNT, # Number of GPUs to attach
    base_output_dir=ARTIFACT_DIR,    # Sets AIP_MODEL_DIR used by your script for all artifacts
    sync=True,                       # Waits for job to finish so you can inspect outputs immediately
)

print("Artifacts folder:", ARTIFACT_DIR)

Just as we did for the CPU job, let’s evaluate the GPU-trained model to confirm it produces the same accuracy. We load the model weights directly from GCS into memory.

PYTHON

import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet

# -----------------
# download model.pt straight into memory and load weights
# -----------------

ARTIFACT_PREFIX = f"artifacts/pytorch/{RUN_ID}/model"

MODEL_PATH = f"{ARTIFACT_PREFIX}/model.pt"
model_blob = bucket.blob(MODEL_PATH)
model_bytes = model_blob.download_as_bytes()

# load from bytes
model_pt = io.BytesIO(model_bytes)

# rebuild model and load weights
state = torch.load(model_pt, map_location="cpu", weights_only=True)
m = TitanicNet()
m.load_state_dict(state)
m.eval();

Evaluate the GPU model using the same pattern — results should match because we set random seeds in train_nn.py.

PYTHON

with torch.no_grad():
    probs = m(X_val_t).squeeze(1)
    preds_t = (probs >= 0.5).long()
    correct = (preds_t == y_val_t).sum().item()
    acc = correct / y_val_t.shape[0]

print(f"GPU model val accuracy: {acc:.4f}")
Challenge

Cloud workflow review

Now that you’ve run both a CPU and GPU training job, answer the following:

  1. Artifact location: Where did Vertex AI write your model artifacts? How does base_output_dir in job.run() relate to the AIP_MODEL_DIR environment variable inside the container?
  2. CPU vs. GPU job time: Compare the wall-clock times of your CPU and GPU jobs (visible in the Console under Vertex AI > Training > Custom Jobs). Which was faster? Why might the GPU job be slower for this dataset?
  3. Container choice: We used pytorch-xla.2-4.py310 for the CPU job and pytorch-gpu.2-4.py310 for the GPU job. What would happen if you used the CPU container but still passed accelerator_type and accelerator_count?
  4. Cost awareness: You used n1-standard-4 for CPU and n1-standard-8 + T4 for GPU. Using the Compute for ML resource, estimate the relative hourly cost difference between these configurations.
  1. base_output_dir tells the Vertex AI SDK to set the AIP_MODEL_DIR environment variable inside the training container. Your script reads os.environ.get("AIP_MODEL_DIR", ".") and writes all artifacts there. The result is everything lands under gs://<bucket>/artifacts/pytorch/<RUN_ID>/model/.
  2. For the small Titanic dataset (~700 training rows), the CPU job is typically faster end-to-end. GPU jobs incur extra overhead: provisioning the accelerator, loading CUDA libraries, and transferring data to the GPU. GPU acceleration pays off when training itself is the bottleneck (larger models, larger batches).
  3. The job would either fail or ignore the GPU. The CPU container doesn’t include CUDA/cuDNN, so even if a GPU is attached to the VM, PyTorch can’t use it. Always match your container image to your hardware configuration.
  4. Approximate on-demand rates (us-central1): n1-standard-4 is ~ $0.19/hr; n1-standard-8 + 1x T4 is ~ $0.54/hr (VM) + ~ $0.35/hr (T4) = ~ $0.89/hr total. The GPU configuration is roughly 4–5x more expensive per hour — worth it only when training speedup exceeds that cost ratio.

GPU and scaling considerations

  • On small problems, GPU startup/transfer overhead can erase speedups — benchmark before you scale.
  • Stick to a single GPU unless your workload genuinely saturates it. Multi-GPU (data parallelism / DDP) and model parallelism exist for large-scale training but add significant complexity and cost — well beyond this workshop’s scope.

Clean up staging files


As in Episode 4, each job.run() call leaves a tarball under .vertex_staging/. Delete them to keep your bucket tidy:

PYTHON

!gsutil -m rm -r gs://{BUCKET_NAME}/.vertex_staging/

Additional resources


To learn more about PyTorch and Vertex AI integrations, visit the docs: docs.cloud.google.com/vertex-ai/docs/start/pytorch

Key Points
  • Use CustomTrainingJob with a prebuilt PyTorch container; your script reads AIP_MODEL_DIR (set automatically by base_output_dir) to know where to write artifacts.
  • Keep artifacts together (model, metrics, history, log) in one GCS folder for reproducibility.
  • .npz is a compact, cloud-friendly format — one GCS read per file, preserves exact dtypes.
  • Start on CPU for small datasets; add a GPU only when training time justifies the extra provisioning overhead and cost.
  • staging_bucket is just for the SDK’s packaging tarball — base_output_dir is where your script’s actual artifacts go.

Content from Hyperparameter Tuning in Vertex AI: Neural Network Example


Last updated on 2026-03-05 | Edit this page

Overview

Questions

  • How can we efficiently manage hyperparameter tuning in Vertex AI?
  • How can we parallelize tuning jobs to optimize time without increasing costs?

Objectives

  • Set up and run a hyperparameter tuning job in Vertex AI.
  • Define search spaces using DoubleParameterSpec and IntegerParameterSpec.
  • Log and capture objective metrics for evaluating tuning success.
  • Optimize tuning setup to balance cost and efficiency, including parallelization.

In the previous episode (Episode 5) you submitted a single PyTorch training job to Vertex AI and inspected its artifacts. That gave you one model trained with one set of hyperparameters. In practice, choices like learning rate, early-stopping patience, and regularization thresholds can dramatically affect model quality — and the best combination is rarely obvious up front.

In this episode we’ll use Vertex AI’s Hyperparameter Tuning Jobs to systematically search for better settings. The key is defining a clear search space, ensuring metrics are properly logged, and keeping costs manageable by controlling the number of trials and level of parallelization.

Key steps for hyperparameter tuning

The overall process involves these steps:

  1. Prepare the training script and ensure metrics are logged.
  2. Define the hyperparameter search space.
  3. Configure a hyperparameter tuning job in Vertex AI.
  4. Set data paths and launch the tuning job.
  5. Monitor progress in the Vertex AI Console.
  6. Extract the best model and inspect recorded metrics.

Initial setup


1. Open pre-filled notebook

Navigate to /Intro_GCP_for_ML/notebooks/06-Hyperparameter-tuning.ipynb to begin this notebook. Select the PyTorch environment (kernel). Local PyTorch is only needed for local tests — your Vertex AI job uses the container specified by container_uri (e.g., pytorch-xla.2-4.py310), so it brings its own framework at run time.

2. CD to instance home directory

Change to your Jupyter home folder to keep paths consistent.

PYTHON

%cd /home/jupyter/

Prepare and configure the tuning job


3. Understand how the training script reports metrics

Your training script (train_nn.py) already includes hyperparameter tuning metric reporting — you don’t need to modify it. Here’s how it works:

The script uses the cloudml-hypertune library (pre-installed on Vertex AI training workers) to report metrics so the tuner can compare trials. A try/except block lets the same script run locally without crashing:

PYTHON

# Already in train_nn.py — initialization near the top:
try:
    from hypertune import HyperTune
    _hpt = HyperTune()
    _hpt_enabled = True
except Exception:
    _hpt = None
    _hpt_enabled = False

Inside the training loop, after computing validation metrics each epoch:

PYTHON

# Already in train_nn.py — inside the epoch loop:
if _hpt_enabled:
    _hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag="validation_accuracy",
        metric_value=val_acc,
        global_step=ep,
    )

The critical detail: the hyperparameter_metric_tag string must exactly match the key you use in metric_spec when configuring the tuning job (e.g., "validation_accuracy"). If they don’t match, trials will show as INFEASIBLE.

4. Define hyperparameter search space

This step defines which parameters Vertex AI will vary across trials and their allowed ranges. The number of total settings tested is determined later using max_trial_count.

Vertex AI uses Bayesian optimization by default (internally listed as "ALGORITHM_UNSPECIFIED" in the API). That means if you don’t explicitly specify a search algorithm, Vertex AI automatically applies an adaptive Bayesian strategy to balance exploration (trying new areas of the parameter space) and exploitation (focusing near the best results so far). Each completed trial helps the tuner model how your objective metric (for example, validation_accuracy) changes across parameter values. Subsequent trials then sample new parameter combinations that are statistically more likely to improve performance, which usually yields better results than random or grid search—especially when max_trial_count is limited.

Vertex AI supports four parameter spec types. This episode uses the first two:

Spec type Use case Example
DoubleParameterSpec Continuous floats Learning rate 1e-4 to 1e-2
IntegerParameterSpec Whole numbers Patience 5 to 20
DiscreteParameterSpec Specific numeric values Batch size [32, 64, 128]
CategoricalParameterSpec Named options (strings) Optimizer [“adam”, “sgd”]

Include early-stopping parameters so the tuner can learn good stopping behavior for your dataset:

PYTHON

from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=1e-4, max=1e-2, scale="log"),
    "patience": hpt.IntegerParameterSpec(min=5, max=20, scale="linear"),
    "min_delta": hpt.DoubleParameterSpec(min=1e-6, max=1e-3, scale="log"),
}

5. Initialize Vertex AI, project, and bucket

Initialize the Vertex AI SDK and set your staging and artifact locations in GCS.

PYTHON

from google.cloud import aiplatform, storage
import datetime as dt

client = storage.Client()
PROJECT_ID = client.project
REGION = "us-central1"
LAST_NAME = "DOE"  # change to your name or unique ID
BUCKET_NAME = "doe-titanic"  # replace with your bucket name

aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=f"gs://{BUCKET_NAME}/.vertex_staging",
)

6. Define runtime configuration

Create a unique run ID and set the container, machine type, and base output directory for artifacts. Each variable controls a different aspect of the training environment:

  • RUN_ID — a timestamp that uniquely identifies this tuning session, used to organize artifacts in GCS.
  • ARTIFACT_DIR — the GCS folder where all trial outputs (models, metrics, logs) will be written.
  • IMAGE — the prebuilt Docker container that includes PyTorch and its dependencies.
  • MACHINE — the VM shape (CPU/RAM) for each trial. Start small for testing.
  • ACCELERATOR_TYPE / ACCELERATOR_COUNT — set to unspecified/0 for CPU-only runs. As we saw in Episode 5, GPU overhead isn’t worth it for a dataset this small, and HP tuning launches multiple trials, so unnecessary GPUs multiply cost quickly. Change these to attach a GPU when your model or data genuinely benefits from one.

PYTHON

RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch_hpt/{RUN_ID}"

IMAGE = "us-docker.pkg.dev/vertex-ai/training/pytorch-xla.2-4.py310:latest"  # XLA container includes cloudml-hypertune
MACHINE = "n1-standard-4"
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0

7. Configure hyperparameter tuning job

When you use Vertex AI Hyperparameter Tuning Jobs, each trial needs a complete, runnable training configuration: the script, its arguments, the container image, and the compute environment.
Rather than defining these pieces inline each time, we create a CustomJob to hold that configuration.

The CustomJob acts as the blueprint for running a single training task — specifying exactly what to run and on what resources. The tuner then reuses that job definition across all trials, automatically substituting in new hyperparameter values for each run.

This approach has a few practical advantages:

  • You only define the environment once — machine type, accelerators, and output directories are all reused across trials.
  • The tuner can safely inject trial-specific parameters (those declared in parameter_spec) while leaving other arguments unchanged.
  • It provides a clean separation between what a single job does (CustomJob) and how many times to repeat it with new settings (HyperparameterTuningJob).
  • It avoids the extra abstraction layers of higher-level wrappers like CustomTrainingJob, which automatically package code and environments. Using CustomJob.from_local_script keeps the workflow predictable and explicit.

In short:
CustomJob defines how to run one training run.
HyperparameterTuningJob defines how to repeat it with different parameter sets and track results.

The number of total runs is set by max_trial_count, and the number of simultaneous runs is controlled by parallel_trial_count. Each trial’s output and metrics are logged under the GCS base_output_dir.

For a first pass, we’ll run 3 trials fully in parallel. With only 3 trials the adaptive optimizer has almost nothing to learn from, so running them simultaneously costs no search quality. This still validates that the full pipeline works end-to-end (metrics are reported, artifacts land in GCS, the tuner picks a best trial) while giving you a quick look at how results vary across different parameter combinations.

PYTHON

# metric_spec = {"validation_loss": "minimize"} - also stored by train_nn.py
metric_spec = {"validation_accuracy": "maximize"}

custom_job = aiplatform.CustomJob.from_local_script(
    display_name=f"{LAST_NAME}_pytorch_hpt-trial_{RUN_ID}",
    script_path="Intro_GCP_for_ML/scripts/train_nn.py",
    container_uri=IMAGE,
    requirements=["python-json-logger>=2.0.7"],  # resolves a dependency conflict in the prebuilt container
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        "--learning_rate=0.001",        # HPT will override when sampling
        "--patience=10",                # HPT will override when sampling
        "--min_delta=0.001",            # HPT will override when sampling
    ],
    base_output_dir=ARTIFACT_DIR,
    machine_type=MACHINE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_COUNT,
)

DISPLAY_NAME = f"{LAST_NAME}_pytorch_hpt_{RUN_ID}"

# Start with a small batch of 3 trials, all in parallel.
# With so few trials the adaptive optimizer has nothing to learn from,
# so full parallelism costs no search quality — and finishes faster.
tuning_job = aiplatform.HyperparameterTuningJob(
    display_name=DISPLAY_NAME,
    custom_job=custom_job,                 # must be a CustomJob (not CustomTrainingJob)
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=3,                     # small initial sweep
    parallel_trial_count=3,                # all at once — adaptive search needs more data to help
    # search_algorithm="ALGORITHM_UNSPECIFIED",  # default = adaptive search (Bayesian)
    # search_algorithm="RANDOM_SEARCH",          # optional override
    # search_algorithm="GRID_SEARCH",            # optional override
)

tuning_job.run(sync=True)
print("HPT artifacts base:", ARTIFACT_DIR)

Run and analyze results


8. Monitor tuning job

Open Vertex AI → Training → Hyperparameter tuning jobs in the Cloud Console to track trials, parameters, and metrics. You can also stop jobs from the console if needed.

Note: Replace the project ID in the URL below with your own if you are not using the shared workshop project.

For the MLM25 workshop: Hyperparameter tuning jobs.

Callout

Troubleshooting common HPT issues

  • All trials show INFEASIBLE: The hyperparameter_metric_tag in your training script doesn’t match the key in metric_spec. Double-check spelling and case — "validation_accuracy" is not "val_accuracy".
  • Quota errors on launch: Your project may not have enough VM or GPU quota in the selected region. Check IAM & Admin → Quotas and request an increase or switch to a smaller MACHINE type.
  • Trial succeeds but metrics are empty: Make sure cloudml-hypertune is importable inside the container. The prebuilt PyTorch containers include it. If using a custom container, add cloudml-hypertune to your requirements.
  • Job stuck in PENDING: Another tuning or training job may be consuming your quota. Check Vertex AI → Training for running jobs.

9. Inspect best trial results

After completion, look up the best configuration and objective value from the SDK:

PYTHON

best_trial = tuning_job.trials[0]  # best-first
print("Best hyperparameters:", best_trial.parameters)
print("Best validation_accuracy:", best_trial.final_measurement.metrics)

10. Review recorded metrics in GCS

Your script writes a metrics.json (with keys such as final_val_accuracy, final_val_loss) to each trial’s output directory (under ARTIFACT_DIR). The snippet below aggregates those into a dataframe for side-by-side comparison.

PYTHON

from google.cloud import storage
import json, pandas as pd

def list_metrics_from_gcs(ARTIFACT_DIR: str):
    client = storage.Client()
    bucket_name = ARTIFACT_DIR.replace("gs://", "").split("/")[0]
    prefix = "/".join(ARTIFACT_DIR.replace("gs://", "").split("/")[1:])
    blobs = client.list_blobs(bucket_name, prefix=prefix)

    records = []
    for blob in blobs:
        if blob.name.endswith("metrics.json"):
            # Path: …/{RUN_ID}/{trial_number}/model/metrics.json → [-3] = trial number
            trial_id = blob.name.split("/")[-3]
            data = json.loads(blob.download_as_text())
            data["trial_id"] = trial_id
            records.append(data)
    return pd.DataFrame(records)

df = list_metrics_from_gcs(ARTIFACT_DIR)
cols = ["trial_id","final_val_accuracy","final_val_loss","best_val_loss",
        "best_epoch","patience","min_delta","learning_rate"]
df_sorted = df[cols].sort_values("final_val_accuracy", ascending=False)
print(df_sorted)

11. Visualize trial comparison

A quick chart makes it easier to see which trials performed best and how learning rate relates to accuracy:

PYTHON

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart: accuracy per trial
axes[0].barh(df_sorted["trial_id"].astype(str), df_sorted["final_val_accuracy"])
axes[0].set_xlabel("Validation Accuracy")
axes[0].set_ylabel("Trial")
axes[0].set_title("Accuracy by Trial")

# Scatter: learning rate vs accuracy (color = patience)
sc = axes[1].scatter(
    df_sorted["learning_rate"], df_sorted["final_val_accuracy"],
    c=df_sorted["patience"], cmap="viridis", edgecolors="k", s=80,
)
axes[1].set_xscale("log")
axes[1].set_xlabel("Learning Rate (log scale)")
axes[1].set_ylabel("Validation Accuracy")
axes[1].set_title("LR vs. Accuracy (color = patience)")
plt.colorbar(sc, ax=axes[1], label="patience")

plt.tight_layout()
plt.show()
Challenge

Exercise 1: Widen the learning-rate search space

The current search space uses min=1e-4, max=1e-2 for learning rate. Suppose you suspect that slightly larger learning rates (up to 0.1) might converge faster with early stopping enabled.

  1. Update parameter_spec to widen the learning_rate range to max=0.1.
  2. Thinking question: Why does scale="log" make sense for learning rate but scale="linear" makes sense for patience?
  3. Do not run the job yet — just update the configuration.

PYTHON

parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=1e-4, max=1e-1, scale="log"),
    "patience": hpt.IntegerParameterSpec(min=5, max=20, scale="linear"),
    "min_delta": hpt.DoubleParameterSpec(min=1e-6, max=1e-3, scale="log"),
}

Why log vs. linear? Learning rate values span several orders of magnitude (0.0001 to 0.1), so scale="log" ensures the tuner samples evenly across those orders rather than clustering near the high end. Patience is an integer (5–20) where each step is equally meaningful, so scale="linear" is appropriate.

PYTHON

tuning_job = aiplatform.HyperparameterTuningJob(
    display_name=DISPLAY_NAME,
    custom_job=custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=12,
    parallel_trial_count=3,
)

Cost estimate: 12 trials x 5 min each = 60 minutes of compute. At ~ $0.19/hr for n1-standard-4, that’s roughly $0.19 total. With parallel_trial_count=3, wall-clock time would be approximately 20 minutes (4 batches of 3 trials).

Why not run all 12 in parallel? With 12 trials we have enough data for the adaptive optimizer to learn: after each batch of 3 completes, the tuner updates its model of which regions of the search space are promising and steers the next batch toward them. Running all 12 at once would turn the search into an expensive random sweep — every trial would be launched “blind” before any results come back.

Discussion

What is the effect of parallelism in tuning?

  • How might running 10 trials in parallel differ from running 2 at a time in terms of cost, time, and result quality?
  • When would you want to prioritize speed over adaptive search benefits?
Factor High parallelism (e.g., 10) Low parallelism (e.g., 2)
Wall-clock time Shorter Longer
Total cost ~Same (slightly more overhead) ~Same
Adaptive search quality Worse (tuner explores “blind”) Better (tuner learns between batches)
Best for Cheap/short trials, deadlines Expensive trials, small budgets

Why does parallelism hurt result quality? Vertex AI’s adaptive search learns from completed trials to choose better parameter combinations. With many trials in flight simultaneously, the tuner can’t incorporate results quickly — it explores “blind” for longer, often yielding slightly worse results for a fixed max_trial_count. With modest parallelism (2–4), the tuner can update beliefs and exploit promising regions between batches.

Guidelines: - Keep parallel_trial_count to ≤ 25–33% of max_trial_count when you care about adaptive quality. - Increase parallelism when trials are long and the search space is well-bounded.

Callout

When to prioritize speed vs. adaptive quality

Favor higher parallelism when you have strict deadlines, very cheap/short trials where startup time dominates, a non-adaptive search, or unused quota/credits.

Favor lower parallelism when trials are expensive or noisy, max_trial_count is small (≤ 10–20), early stopping is enabled, or you’re exploring many dimensions at once.

Practical recipe: - First run: max_trial_count=3, parallel_trial_count=3 (pipeline sanity check — too few trials for adaptive search to help, so run them all at once). - Main run: max_trial_count=10–20, parallel_trial_count=2–4 (enough trials for the optimizer to learn between batches). - Scale up parallelism only after the above completes cleanly and you confirm adaptive performance is acceptable.

Clean up staging files


HP tuning launches multiple trials, so staging tarballs accumulate even faster. Delete them when you’re done:

PYTHON

!gsutil -m rm -r gs://{BUCKET_NAME}/.vertex_staging/

What’s next: using your tuned model


After tuning, your best model’s weights sit in GCS under the best trial’s artifact directory. The most common next steps are:

  • Batch prediction (most common): Load the best model from GCS and run inference on a dataset — this is what we did in the evaluation sections of Episodes 4–5 when we loaded models from GCS into memory. For larger-scale batch prediction, Vertex AI offers Batch Prediction Jobs that handle provisioning and scaling automatically.
  • Experiment tracking: Vertex AI Experiments can log metrics, parameters, and artifacts across runs for systematic comparison. Consider integrating this into your workflow as your projects grow.
  • Online deployment: If you need real-time predictions via an API, Vertex AI Endpoints let you deploy your model — but endpoints bill continuously (~ $4.50/day for an n1-standard-4), so only deploy when you genuinely need a live API.
Key Points
  • Vertex AI Hyperparameter Tuning Jobs efficiently explore parameter spaces using adaptive strategies.
  • Define parameter ranges in parameter_spec; the number of settings tried is controlled later by max_trial_count.
  • The hyperparameter_metric_tag reported by cloudml-hypertune must exactly match the key in metric_spec.
  • Limit parallel_trial_count (2–4) to help adaptive search.
  • Use GCS for input/output and aggregate metrics.json across trials for detailed analysis.

Content from Retrieval-Augmented Generation (RAG) with Vertex AI


Last updated on 2026-03-05 | Edit this page

Overview

Questions

  • How do we go from “a pile of PDFs” to “ask a question and get a cited answer” using Google Cloud tools?
  • What are the key parts of a RAG system (chunking, embedding, retrieval, generation), and how do they map onto Vertex AI services?
  • How much does each part of this pipeline cost (VM time, embeddings, LLM calls), and where can we keep it cheap?

Objectives

  • Unpack the core RAG pipeline: ingest → chunk → embed → retrieve → answer.
  • Run a minimal, fully programmatic RAG loop on a Vertex AI Workbench VM using Google’s foundation models for embeddings and generation.
  • Answer questions using content from provided papers and return grounded answers backed by source text, not unverifiable claims.

Background concepts


This episode shifts from classical ML training (Episodes 4–6) to working with large language models (LLMs). If any of the following terms are new to you, here’s a quick primer:

  • Embeddings: A numerical vector (list of numbers) that represents the meaning of a piece of text. Texts with similar meanings have similar vectors. This lets us search “by meaning” rather than by keyword matching.
  • Cosine similarity: A measure of how similar two vectors are (1.0 = identical direction, 0.0 = unrelated). Used to find which stored text chunks are most relevant to a question.
  • Large Language Model (LLM): A model (like Gemini, GPT, or LLaMA) trained on massive text corpora that can generate coherent text given a prompt. In this episode, we use an LLM to answer questions based on retrieved text, not to train one from scratch.
  • Foundation model APIs: In this episode, we use the google-genai client library to access Google’s managed embedding and generation models. This is separate from the google-cloud-aiplatform SDK used for training jobs in earlier episodes.

Overview: What we’re building


Retrieval-Augmented Generation (RAG) is a pattern:

  1. You ask a question.
  2. The system retrieves relevant passages from your PDFs or data.
  3. An LLM answers using those passages only, with citations.

This approach is useful any time you need to ground an LLM’s answers in a specific corpus — research papers, policy documents, lab notebooks, etc. For example, a sustainability research team could use this pipeline to extract AI water and energy metrics from published papers, getting cited answers instead of generic LLM summaries.

Architecture diagram showing the RAG pipeline: a Workbench notebook orchestrates document chunking, embedding via the Gemini API, and retrieval-augmented generation, with documents and embeddings stored in a GCS bucket.
RAG pipeline with Gemini API

About the corpus

Our corpus is a curated bundle of 32 research papers on the environmental and economic costs of AI — topics like training energy, inference power consumption, water footprint, and carbon emissions. The papers span 2019–2025 and include titles such as “Green AI”, “Making AI Less Thirsty”, and “The ML.ENERGY Benchmark”. They’re shipped as data/pdfs_bundle.zip in the lesson repository so that everyone works with the same documents. You could swap in your own PDFs — the pipeline is corpus-agnostic.

Step 1: Set up the environment


Navigate to /Intro_GCP_for_ML/notebooks/07-Retrieval-augmented-generation.ipynb to begin this notebook. Select the Python 3 (ipykernel) kernel — this episode uses only the google-genai client library and scikit-learn, so no PyTorch or TensorFlow kernel is needed.

CD to instance home directory

To ensure we’re all in the same starting spot, change directory to your Jupyter home directory.

PYTHON

%cd /home/jupyter/

We need the pypdf library to extract text from PDF files.

PYTHON

!pip install --quiet --upgrade pypdf

Cost note: Installing packages is free; you’re only billed for VM runtime.

Initialize project

We initialize the vertexai SDK to give our notebook access to Google’s foundation models (embeddings and Gemini). Both the project ID and region are needed so API calls are billed to your project.

PYTHON

from vertexai import init as vertexai_init
import os

PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "<YOUR_PROJECT_ID>")
REGION = "us-central1"

vertexai_init(project=PROJECT_ID, location=REGION)
print("Initialized:", PROJECT_ID, REGION)

Step 2: Extract and chunk PDFs


Before we can search our documents, we need to break them into smaller pieces (“chunks”). Embedding models produce better vectors from focused passages than from entire papers, and LLMs have limited context windows. The code below extracts text from each PDF and splits it into overlapping chunks of roughly 1,200 characters.

PYTHON

import zipfile, pathlib, re, pandas as pd
from pypdf import PdfReader

ZIP_PATH = pathlib.Path("Intro_GCP_for_ML/data/pdfs_bundle.zip")
DOC_DIR  = pathlib.Path("/home/jupyter/docs")
DOC_DIR.mkdir(exist_ok=True)

# unzip
with zipfile.ZipFile(ZIP_PATH, "r") as zf:
    zf.extractall(DOC_DIR)

def chunk_text(text, max_chars=1200, overlap=150):
    for i in range(0, len(text), max_chars - overlap):
        yield text[i:i+max_chars]

rows = []
for pdf in DOC_DIR.glob("*.pdf"):
    txt = ""
    for page in PdfReader(str(pdf)).pages:
        txt += page.extract_text() or ""
    for i, chunk in enumerate(chunk_text(re.sub(r"\s+", " ", txt))):
        rows.append({"doc": pdf.name, "chunk_id": i, "text": chunk})

corpus_df = pd.DataFrame(rows)
print(len(corpus_df), "chunks created")

Cost note: Only VM runtime applies. Chunk size affects future embedding cost — fewer, larger chunks mean fewer API calls but potentially noisier embeddings.

Callout

Why these chunking parameters?

The max_chars=1200 / overlap=150 values are practical defaults, not magic numbers:

  • 1,200 characters (~200–300 tokens) keeps each chunk within a single focused idea while staying well under the embedding model’s 8,000-token limit.
  • 150-character overlap ensures that sentences split across chunk boundaries are still captured in at least one chunk.
  • Character-based splitting is simple and predictable. Sentence-level or paragraph-level chunking can produce better results but requires an NLP tokenizer and more code.

Chunk size is a key tuning knob: smaller chunks give more precise retrieval but lose surrounding context; larger chunks preserve context but may dilute the embedding with irrelevant text. There’s no single best answer — experiment with your own corpus.

Step 3: Embed the corpus with Vertex AI


Now we convert each text chunk into a numerical vector (an “embedding”) so we can search by meaning rather than keywords. We use Google’s gemini-embedding-001 model — currently the top-ranked Google embedding model on the MTEB leaderboard. It accepts up to 2,048 input tokens per text (~1,500 words), supports 100+ languages, and uses Matryoshka Representation Learning so you can choose your output dimensions (768, 1,536, or 3,072) without retraining — smaller dimensions save memory and speed up search, while larger ones preserve more semantic detail. See the Choosing an embedding model callout later in this episode for alternatives.

Initialize the Gen AI client

PYTHON

from google import genai
from google.genai.types import HttpOptions, EmbedContentConfig, GenerateContentConfig
import numpy as np

client = genai.Client(
    http_options=HttpOptions(api_version="v1"),
    vertexai=True,          # route calls through your GCP project for billing
    project=PROJECT_ID,
    location=REGION,
)

# Embedding model and dimensions
EMBED_MODEL_ID = "gemini-embedding-001"
EMBED_DIM = 1536   # valid choices: 768, 1536, 3072

Build the embedding helper

The helper below converts text strings into embedding vectors in batches. Notice the task_type parameter: the Gemini embedding model optimizes its vectors differently depending on whether the input is a document being indexed or a query being searched. Using RETRIEVAL_DOCUMENT for corpus chunks and RETRIEVAL_QUERY for user questions produces better retrieval accuracy than using a single task type for both.

PYTHON

def embed_texts(text_list, batch_size=32, dims=EMBED_DIM, task_type="RETRIEVAL_DOCUMENT"):
    """
    Embed a list of strings using gemini-embedding-001.
    Returns a NumPy array of shape (len(text_list), dims).

    task_type should be "RETRIEVAL_DOCUMENT" for corpus chunks
    and "RETRIEVAL_QUERY" for user questions.
    """
    vectors = []
    for start in range(0, len(text_list), batch_size):
        batch = text_list[start : start + batch_size]
        resp = client.models.embed_content(
            model=EMBED_MODEL_ID,
            contents=batch,
            config=EmbedContentConfig(
                task_type=task_type,
                output_dimensionality=dims,
            ),
        )
        for emb in resp.embeddings:
            vectors.append(emb.values)
    return np.array(vectors, dtype="float32")

Embed all chunks and build the retrieval index

We embed the full corpus, then build a nearest-neighbors index so that future queries are fast. Think of this as two separate stages:

  1. Embed & index (now) — We convert every chunk into a vector and hand the matrix to scikit-learn’s NearestNeighbors. Calling .fit() here doesn’t train a model — it organizes the vectors into a data structure optimized for similarity search (like building a phone book before anyone looks up a number).
  2. Query (later, in Step 4) — When a user question arrives, we embed that question and call .kneighbors() to find the corpus vectors closest to it by cosine similarity.

We set metric="cosine" so the index knows how to measure closeness when queries arrive. The n_neighbors=5 default means each query returns the 5 most relevant chunks — enough to give the LLM good context without overwhelming it with noise. You can tune this: fewer neighbors (3) gives more focused answers; more (10) gives broader coverage at the cost of including less-relevant text.

PYTHON

from sklearn.neighbors import NearestNeighbors

# Embed every chunk in the corpus
emb_matrix = embed_texts(corpus_df["text"].tolist(), dims=EMBED_DIM)
print("emb_matrix shape:", emb_matrix.shape)   # (num_chunks, EMBED_DIM)

# Build nearest-neighbors index
nn = NearestNeighbors(metric="cosine", n_neighbors=5)
nn.fit(emb_matrix)

Step 4: Retrieve and generate answers with Gemini


With embeddings indexed, we can now build the two remaining pieces of the RAG pipeline: a retrieve function that finds relevant chunks for a question, and an ask function that sends those chunks to Gemini for a grounded answer.

Retrieve relevant chunks

PYTHON

def retrieve(query, k=5):
    """
    Embed the user query and find the top-k most similar corpus chunks.
    Returns a DataFrame with a 'similarity' column.
    """
    query_vec = embed_texts(
        [query], dims=EMBED_DIM, task_type="RETRIEVAL_QUERY"
    )[0]

    distances, indices = nn.kneighbors([query_vec], n_neighbors=k, return_distance=True)

    result_df = corpus_df.iloc[indices[0]].copy()
    result_df["similarity"] = 1 - distances[0]   # cosine distance → similarity
    return result_df.sort_values("similarity", ascending=False)

Generate a grounded answer

The ask() function ties the full pipeline together: retrieve → build prompt → call Gemini. The temperature=0.2 setting keeps answers factual and deterministic. The prompt instructs Gemini to answer only from the provided context and cite the source chunks.

PYTHON

GENERATION_MODEL_ID = "gemini-2.5-pro"   # or "gemini-2.5-flash" for cheaper/faster

def ask(query, top_k=5, temperature=0.2):
    """
    Full RAG pipeline: retrieve context, build prompt, generate answer.
    """
    hits = retrieve(query, k=top_k)

    # Build context block with source tags for citation
    context_lines = [
        f"[{row.doc}#chunk-{row.chunk_id}] {row.text}"
        for _, row in hits.iterrows()
    ]
    context_block = "\n\n".join(context_lines)

    prompt = (
        "You are a research assistant. "
        "Use only the following context to answer the question. "
        "Cite your sources using the [doc#chunk] tags.\n\n"
        f"{context_block}\n\n"
        f"Q: {query}\n"
        "A:"
    )

    response = client.models.generate_content(
        model=GENERATION_MODEL_ID,
        contents=prompt,
        config=GenerateContentConfig(temperature=temperature),
    )
    return response.text

Test the pipeline end-to-end

PYTHON

print(ask("How much energy does it cost to train a large language model?"))
Challenge

Challenge 1: Explore chunk size tradeoffs

Change the max_chars parameter in chunk_text() to 500 and then to 2500. Re-run the chunking, embedding, and retrieval steps each time, then ask the same question.

  • How does the number of chunks change?
  • Does the answer quality improve or degrade?
  • Which chunk size gives the best balance of precision and context?

Smaller chunks (500 chars) produce more precise retrieval hits but each chunk has less context, so Gemini may struggle to synthesize a complete answer. Larger chunks (2,500 chars) preserve more context but may dilute the embedding with unrelated text, leading to less accurate retrieval. For most research-paper corpora, 800–1,500 characters is a practical sweet spot.

Challenge

Challenge 2: Test hallucination behavior

Ask a question that has no answer in the corpus — for example:

PYTHON

print(ask("What was the GDP of France in 2019?"))
  • Does Gemini refuse to answer, or does it hallucinate?
  • Try modifying the system prompt in ask() to add: “If the context does not contain enough information to answer, say ‘I don’t have enough information to answer this.’”
  • Does the modified prompt change the behavior?

Without the guardrail prompt, Gemini may produce a plausible-sounding answer from its training data, ignoring the “use only the following context” instruction. Adding an explicit refusal instruction significantly reduces hallucination. This is a key lesson: prompt engineering is part of RAG system design, not just model selection.

Challenge

Challenge 3: Compare gemini-2.5-pro vs gemini-2.5-flash

Change GENERATION_MODEL_ID to "gemini-2.5-flash" and ask the same question.

PYTHON

# Change the generation model and re-run a query
GENERATION_MODEL_ID = "gemini-2.5-flash"
print(ask("How much energy does it cost to train a large language model?"))
  • Is the answer quality noticeably different?
  • How does response time compare?
  • Check the Vertex AI pricing page — what’s the cost difference per million tokens?

For well-grounded RAG queries (where the answer is clearly in the context), Flash often produces comparable answers at significantly lower cost and latency. Pro shines when the question requires more nuanced reasoning across multiple chunks. For workshop-scale workloads, Flash is usually sufficient and much cheaper.

Challenge

Challenge 4: Tune retrieval depth with top_k

Call ask() with top_k=2 and then with top_k=10. Compare the answers.

PYTHON

# Try different retrieval depths
print("--- top_k=2 ---")
print(ask("How much energy does it cost to train a large language model?", top_k=2))

print("\n--- top_k=10 ---")
print(ask("How much energy does it cost to train a large language model?", top_k=10))
  • With top_k=2, does Gemini miss relevant information?
  • With top_k=10, does the extra context help or introduce noise?
  • What value of top_k seems to work best for your question?

Lower top_k gives Gemini a tighter, more focused context — good when the answer is localized in one or two chunks. Higher top_k provides broader coverage but risks including irrelevant passages that can confuse the model or dilute the answer. A good default is 3–5 for most research-paper RAG tasks. For questions that span multiple sections of a paper, higher values help.

Challenge

Challenge 5: Try different questions

The quality of a RAG system depends heavily on the questions you ask. Try these queries — each tests a different aspect of retrieval and generation:

PYTHON

# Off-topic question — not covered by the corpus at all
print(ask("How much does an elephant weight?"))

print("\n" + "="*60 + "\n")

# Comparative question — requires synthesizing across sources
print(ask("Is cloud computing more energy efficient than university HPC clusters?"))

print("\n" + "="*60 + "\n")

# Opinion/marketing question — may tempt the model to go beyond the corpus
print(ask("Is Google Cloud the best cloud provider option?"))

For each question, consider: - Does the answer cite specific numbers or papers from the corpus? - Does Gemini stay grounded in the retrieved context, or does it add outside knowledge? - Which question produces the most useful, well-supported answer?

The elephant-weight question is deliberately off-topic — the corpus is about environmental costs of AI, not zoology, so a well-behaved RAG system should indicate that the context doesn’t contain relevant information rather than answering from general knowledge. The cloud-vs-HPC question requires the model to compare across sources — look for whether it hedges appropriately when papers disagree. The “best cloud provider” question is deliberately tricky: the corpus is about environmental costs of AI, not cloud provider rankings, so a well-behaved RAG system should indicate that the context doesn’t support a definitive answer rather than generating marketing-style claims.

Step 5: Cost summary


Understanding the cost of each pipeline component helps you decide where to optimize. For a small workshop with a handful of PDFs, total costs are typically well under $1.

Step Resource Cost Driver Typical Range
VM runtime Vertex AI Workbench (n1-standard-4) Uptime (hourly) ~ $0.20/hr
Embeddings gemini-embedding-001 Tokens embedded (one-time) ~ $0.10 / 1M tokens
Retrieval Local NearestNeighbors CPU only Free
Generation gemini-2.5-pro Input + output tokens per query ~ $1.25$10 / 1M tokens
Generation (alt) gemini-2.5-flash Input + output tokens per query ~ $0.15$0.60 / 1M tokens

Tip: Embeddings are the best investment — compute them once, reuse them for every query. Generation is the ongoing cost; choosing Flash over Pro and keeping prompts concise are the two biggest levers.

Callout

Common issues and troubleshooting

  • Rate limiting on the Gemini API: If you see 429 Resource Exhausted errors, wait 30–60 seconds and retry. For large corpora, add a short time.sleep(1) between embedding batches.
  • PDFs with no extractable text: Scanned documents or image-heavy PDFs will return empty strings from PdfReader. Check for empty chunks with corpus_df[corpus_df["text"].str.strip() == ""] and drop them before embedding.
  • Embeddings fail mid-batch: If an embedding call fails partway through, you’ll have partial results. Consider saving emb_matrix to disk after each batch so you can resume rather than re-embedding everything.
  • “Project not found” or permission errors: Make sure your PROJECT_ID matches the project where Vertex AI APIs are enabled. Run gcloud config get-value project in a terminal cell to verify.
Callout

Choosing an embedding model

We use gemini-embedding-001 in this episode, but Vertex AI offers several alternatives in the Model Garden:

  • text-embedding-005 — older model, 768-dimensional output, still widely used.
  • multimodal-embedding-001 — supports image + text embeddings for richer use cases.
  • Third-party models (via Model Garden) — e.g., bge-large-en, cohere-embed-v3, all-MiniLM.

When choosing, consider: output dimensions (higher = more expressive but more memory), token limits, multilingual support, and pricing.

Cleanup note

The embeddings and nearest-neighbors index in this episode are held in memory — they disappear when your notebook kernel restarts or your VM stops. No persistent cloud resources (endpoints, buckets, or managed indexes) were created, so there’s nothing extra to clean up beyond the VM itself. If you’re done for the day, stop your Workbench Instance to avoid ongoing charges (see Episode 9).

Key takeaways


  • Chunk → embed → retrieve → generate is the core RAG loop. Each step has its own tuning knobs.
  • Use Vertex AI managed embeddings and Gemini for a low-ops, cost-controlled pipeline.
  • Cache embeddings — computing them once and reusing them saves the most cost.
  • Prompt engineering matters — how you instruct the LLM to use (or refuse to use) the context directly affects answer quality and hallucination risk.
  • This workflow generalizes to any retrieval task — research papers, policy documents, lab notebooks, etc.
Callout

Hugging Face / open-model alternatives

You can replace the Google-managed APIs used in this episode with open-source models:

  • Embeddings: sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-large-en-v1.5
  • Generators: google/gemma-2b-it, mistralai/Mistral-7B-Instruct, or tiiuae/falcon-7b-instruct

This requires a GPU VM (e.g., n1-standard-8 + T4) and manual model management. Rather than running a large GPU in Workbench, you can launch Vertex AI custom jobs that perform the embedding and generation steps — start with a PyTorch container image and add the HuggingFace libraries as requirements.

What’s next?


This episode built a minimal RAG pipeline from scratch. Here’s where to go from here depending on your goals:

  • Vertex AI Vector Search — Replace the in-memory NearestNeighbors index with a managed, scalable vector database for production workloads with millions of documents.
  • Vertex AI Agent Builder — Build managed RAG applications with built-in grounding, chunking, and retrieval — less code, more guardrails.
  • Evaluation and iteration — Measure retrieval quality (precision@k, recall@k) and generation quality (faithfulness, relevance) to systematically improve your pipeline.
  • Advanced chunking — Explore sentence-level splitting (with spaCy or nltk), recursive chunking, or document-structure-aware chunking for better retrieval on complex papers.
  • Deploying RAG in Bedrock vs. Local: WattBot 2025 Case Study — See how the same sustainability-paper corpus powers a production RAG system deployed on AWS Bedrock and local hardware, with comparisons of cost, latency, and model choice.
Key Points
  • RAG grounds LLM answers in your own data — retrieve first, then generate.
  • Vertex AI provides managed embedding and generation APIs that require minimal infrastructure.
  • Chunk size, retrieval depth (top_k), and prompt design are the primary tuning levers.
  • Always cite retrieved chunks for reproducibility and transparency.
  • Embeddings are computed once and reused; generation cost scales with query volume.

Content from Bonus: CLI Workflows Without Notebooks


Last updated on 2026-03-06 | Edit this page

Overview

Questions

  • How do I submit Vertex AI training jobs from the command line instead of a Jupyter notebook?
  • What does authentication look like when working outside of a Workbench VM?
  • Can I manage GCS buckets, training jobs, and endpoints entirely from a terminal?

Objectives

  • Authenticate with GCP and set a default project using the gcloud CLI.
  • Upload data to GCS and submit a Vertex AI custom training job from the terminal.
  • Monitor, cancel, and clean up jobs using gcloud ai commands.
  • Understand when CLI workflows are more practical than notebooks.
Callout

Bonus episode

This episode is not part of the standard workshop flow. It covers CLI alternatives to the notebook-based workflows from earlier episodes. Contributions and feedback are welcome — open an issue or pull request on the lesson repository.

Why use the CLI?


Throughout this workshop we used Jupyter notebooks on a Vertex AI Workbench VM as our control center. That setup is great for teaching, but it is not the only way — and sometimes it is not the best way. Common situations where a terminal-based workflow makes more sense:

  • Automation and CI/CD — You want a GitHub Actions workflow or a cron job to kick off training runs. Notebooks require manual interaction; shell scripts do not.
  • SSH into an HPC cluster or remote server — You already have a terminal session and do not want to spin up a Workbench VM just to submit a job.
  • Reproducibility — A shell script checked into version control is easier to review and reproduce than a notebook with hidden state.
  • Cost — If all you need is to submit a job, paying for a Workbench VM while you wait is unnecessary. You can submit from Cloud Shell (free) or your laptop.

Everything we did with the Python SDK in Episodes 4–6 has an equivalent gcloud command. This episode walks through the key ones.

Step 1: Install and authenticate


If you are on a Workbench VM, the gcloud CLI is already installed and authenticated via the VM’s service account. On your laptop or another machine you need to install and log in.

Install the gcloud CLI

Follow the instructions for your platform at cloud.google.com/sdk/docs/install. On most systems this is a single installer or package manager command:

BASH

# macOS (Homebrew)
brew install --cask google-cloud-sdk

# Ubuntu / Debian
sudo apt-get install google-cloud-cli

# Windows — download the installer from the link above

Authenticate

BASH

# Interactive browser-based login (laptop / desktop)
gcloud auth login

# Set your default project so you don't need --project on every command
gcloud config set project YOUR_PROJECT_ID

# Set a default region (optional but saves typing)
gcloud config set compute/region us-central1

On a Workbench VM these steps are already done for you — the VM’s attached service account provides credentials automatically. This is the authentication convenience mentioned in Episode 2.

Application Default Credentials

If you also want to use the Python SDK (e.g., aiplatform.init()) outside of a Workbench VM, you need Application Default Credentials (ADC):

BASH

gcloud auth application-default login

This stores a credential file locally that Google client libraries pick up automatically. Without it, Python SDK calls from your laptop will fail with an authentication error.

Step 2: Upload data to GCS


In Episode 3 we uploaded data through the Cloud Console. From the CLI the equivalent is:

BASH

# Create a bucket (if it doesn't already exist)
gcloud storage buckets create gs://doe-titanic \
    --location=us-central1

# Upload the Titanic CSV files
gcloud storage cp ~/Downloads/data/titanic_train.csv gs://doe-titanic/
gcloud storage cp ~/Downloads/data/titanic_test.csv  gs://doe-titanic/

# Verify
gcloud storage ls gs://doe-titanic/
Callout

gsutil vs gcloud storage

Older tutorials may reference gsutil. Google now recommends gcloud storage as the primary CLI for Cloud Storage. The commands are very similar (gsutil cpgcloud storage cp), but gcloud storage is faster for large transfers and receives more active development.

Step 3: Submit a training job


In Episode 4 we used the Python SDK to create and run a CustomTrainingJob. The gcloud equivalent is gcloud ai custom-jobs create. You provide a JSON or YAML config file that describes the job.

Write a job config file

Create a file called xgb_job.yaml:

YAML

# xgb_job.yaml — Vertex AI custom training job config
# Note: display_name goes on the command line (--display-name), not in this file.
# The --config file describes the job *spec* only, using snake_case field names.
worker_pool_specs:
  - machine_spec:
      machine_type: n1-standard-4
    replica_count: 1
    container_spec:
      image_uri: us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.2-1:latest
      args:
        - "--train=gs://doe-titanic/titanic_train.csv"
        - "--max_depth=6"
        - "--eta=0.3"
        - "--subsample=0.8"
        - "--colsample_bytree=0.8"
        - "--num_round=100"
base_output_directory:
  output_uri_prefix: gs://doe-titanic/artifacts/xgb/cli-run/

Replace the bucket name and hyperparameters to match your setup.

Submit the job

BASH

gcloud ai custom-jobs create \
    --region=us-central1 \
    --display-name=cli-xgb-titanic \
    --config=xgb_job.yaml
Callout

Windows users — line continuation syntax

The \ at the end of each line is a Linux / macOS line continuation character. It does not work in the Windows Command Prompt. You have three options:

  1. Put the command on one line (easiest):

    gcloud ai custom-jobs create --region=us-central1 --display-name=cli-xgb-titanic --config=xgb_job.yaml
  2. Use the ^ continuation character (Windows CMD):

    gcloud ai custom-jobs create ^
        --region=us-central1 ^
        --display-name=cli-xgb-titanic ^
        --config=xgb_job.yaml
  3. Use the backtick continuation character (PowerShell):

    gcloud ai custom-jobs create `
        --region=us-central1 `
        --display-name=cli-xgb-titanic `
        --config=xgb_job.yaml

This applies to all multi-line commands in this episode, not just this one.

Vertex AI provisions a VM, runs your training container, and writes outputs to the base_output_directory. The job runs on GCP’s infrastructure, not on your machine — you can close your terminal and it keeps going.

GPU example (PyTorch)

For the PyTorch GPU job from Episode 5, the config includes an acceleratorType and acceleratorCount. Note that the argument names must match exactly what train_nn.py expects (--train, --val, --learning_rate, etc.):

YAML

# pytorch_gpu_job.yaml
worker_pool_specs:
  - machine_spec:
      machine_type: n1-standard-8
      accelerator_type: NVIDIA_TESLA_T4
      accelerator_count: 1
    replica_count: 1
    container_spec:
      image_uri: us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-4.py310:latest
      args:
        - "--train=gs://doe-titanic/data/train_data.npz"
        - "--val=gs://doe-titanic/data/val_data.npz"
        - "--epochs=500"
        - "--learning_rate=0.001"
        - "--patience=50"
base_output_directory:
  output_uri_prefix: gs://doe-titanic/artifacts/pytorch/cli-gpu-run/

Submit the same way:

BASH

gcloud ai custom-jobs create \
    --region=us-central1 \
    --display-name=cli-pytorch-titanic-gpu \
    --config=pytorch_gpu_job.yaml

Step 4: Monitor jobs


List jobs

BASH

gcloud ai custom-jobs list --region=us-central1

This prints a table with job ID, display name, state (JOB_STATE_RUNNING, JOB_STATE_SUCCEEDED, etc.), and creation time.

Stream logs

BASH

gcloud ai custom-jobs stream-logs JOB_ID --region=us-central1

This is the CLI equivalent of watching the log panel in a notebook — output streams to your terminal in real time.

Hyperparameter tuning jobs

The gcloud ai hp-tuning-jobs family works the same way:

BASH

gcloud ai hp-tuning-jobs list --region=us-central1
gcloud ai hp-tuning-jobs stream-logs JOB_ID --region=us-central1

Creating HP tuning jobs via YAML is more verbose — for complex tuning configs, the Python SDK (Episode 6) is often more readable.

Step 5: Check for running resources (don’t skip this)


The biggest risk with CLI workflows is submitting a job — or leaving a notebook VM running — and forgetting about it. Unlike a Workbench notebook where you can see tabs and running kernels, the CLI gives you no visual reminder that something is still billing you. Jobs and VMs keep running whether or not your terminal is open.

Get in the habit of checking before you walk away:

BASH

# Training jobs still running
gcloud ai custom-jobs list --region=us-central1 --filter="state=JOB_STATE_RUNNING"

# HP tuning jobs still running
gcloud ai hp-tuning-jobs list --region=us-central1 --filter="state=JOB_STATE_RUNNING"

# Endpoints still deployed (these bill 24/7, even when idle)
gcloud ai endpoints list --region=us-central1

# Workbench notebook VMs still running
gcloud workbench instances list --location=us-central1-a

If anything shows up that you don’t need, shut it down:

BASH

# Cancel a running training job
gcloud ai custom-jobs cancel JOB_ID --region=us-central1

# Undeploy a model from an endpoint (stops the per-hour charge)
gcloud ai endpoints undeploy-model ENDPOINT_ID \
    --region=us-central1 \
    --deployed-model-id=DEPLOYED_MODEL_ID

# Stop a Workbench notebook VM
gcloud workbench instances stop INSTANCE_NAME --location=us-central1-a
Callout

Cost leaks are silent

A forgotten endpoint bills ~ $1.50$3/hour depending on machine type — that’s $36$72/day doing nothing. A GPU training job you accidentally submitted twice burns money until you cancel it. There’s no pop-up warning; you’ll only find out on your billing dashboard or when you hit a quota.

Build the habit: every time you finish a CLI session, run the check commands above. For a more thorough cleanup checklist, see Episode 9.

Step 6: Download results


After a job succeeds, download artifacts from GCS:

BASH

# List what the job wrote
gcloud storage ls gs://doe-titanic/artifacts/xgb/cli-run/

# Download everything locally
gcloud storage cp -r gs://doe-titanic/artifacts/xgb/cli-run/ ./local_results/

You can then load the model and metrics in a local Python session for evaluation — no Workbench VM required.

Putting it all together: a shell script


Here is a minimal end-to-end script that submits a training job and waits for it to finish. You could check this into your repository or trigger it from CI.

BASH

#!/usr/bin/env bash
set -euo pipefail

PROJECT_ID="your-project-id"
REGION="us-central1"
BUCKET="doe-titanic"
RUN_ID=$(date +%Y%m%d-%H%M%S)

# Upload latest training data
gcloud storage cp ./data/titanic_train.csv gs://${BUCKET}/

# Submit the job
gcloud ai custom-jobs create \
    --region=${REGION} \
    --display-name="xgb-${RUN_ID}" \
    --worker-pool-spec=machine-type=n1-standard-4,replica-count=1,container-image-uri=us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.2-1:latest \
    --args="--train=gs://${BUCKET}/titanic_train.csv,--max_depth=6,--eta=0.3,--num_round=100" \
    --base-output-dir=gs://${BUCKET}/artifacts/xgb/${RUN_ID}/

echo "Job submitted. Check status with:"
echo "  gcloud ai custom-jobs list --region=${REGION}"
Callout

Cloud Shell — free CLI access

If you do not want to install the gcloud CLI locally, you can use Cloud Shell directly in the Google Cloud Console. It gives you a free, temporary Linux VM with gcloud pre-installed and authenticated. Click the terminal icon (“>_“) in the top-right corner of the Cloud Console to open it.

Cloud Shell is a good option for one-off job submissions or quick resource checks without spinning up a Workbench instance.

Challenge

Challenge 1 — Submit a job from the CLI

Using the XGBoost YAML config shown above (adjusted for your bucket name), submit a training job from Cloud Shell or your local terminal. Verify it appears in the Vertex AI Console under Training > Custom Jobs.

BASH

# Edit xgb_job.yaml with your bucket name, then:
gcloud ai custom-jobs create --region=us-central1 --display-name=cli-xgb-titanic --config=xgb_job.yaml

# Confirm it's running:
gcloud ai custom-jobs list --region=us-central1
Challenge

Challenge 2 — Stream logs in real time

Find the job ID from the previous challenge and stream its logs to your terminal. Compare this experience to watching logs in the notebook.

BASH

# Get the job ID from the list output
gcloud ai custom-jobs list --region=us-central1

# Stream logs (replace JOB_ID with the actual ID)
gcloud ai custom-jobs stream-logs JOB_ID --region=us-central1
Challenge

Challenge 3 — Download and inspect artifacts

After your job completes, download the model and metrics files to your local machine. Load metrics.json in Python and verify the accuracy value.

BASH

gcloud storage cp -r gs://YOUR_BUCKET/artifacts/xgb/cli-run/ ./results/
python3 -c "import json; print(json.load(open('./results/model/metrics.json')))"

When to use notebooks vs. CLI


Notebooks CLI / scripts
Best for Exploration, teaching, visualization Automation, CI/CD, reproducibility
Auth setup Automatic on Workbench VMs Requires gcloud auth login or service account keys
Cost Pay for VM uptime while notebook is open Free from Cloud Shell; zero cost from laptop
State management Hidden state can cause issues Stateless scripts are easier to debug
Interactivity Rich (plots, widgets, markdown) Terminal only (or pipe to other tools)

Most real-world ML/AI projects use both: notebooks for early experimentation and CLI/scripts for production runs.

Key Points
  • Every Vertex AI operation available in the Python SDK has an equivalent gcloud CLI command.
  • gcloud ai custom-jobs create submits training jobs from any terminal — no notebook required.
  • Use gcloud auth login and gcloud auth application-default login to authenticate outside of Workbench VMs.
  • Cloud Shell provides free, pre-authenticated CLI access directly in the browser.
  • Shell scripts checked into version control are more reproducible than notebooks with hidden state.
  • CLI workflows give no visual reminder of running resources — always check for active jobs, endpoints, and VMs before walking away.
  • Notebooks and CLI workflows are complementary — use each where it fits best.

Content from Resource Management & Monitoring on Vertex AI (GCP)


Last updated on 2026-03-04 | Edit this page

Overview

Questions

  • How do I monitor and control Vertex AI, Workbench, and GCS costs day‑to‑day?
  • What specifically should I stop, delete, or schedule to avoid surprise charges?
  • How do I set budget alerts so cost leaks get caught quickly?

Objectives

  • Identify the major cost drivers across Vertex AI (training jobs, endpoints, Workbench notebooks) and GCS, with ballpark costs.
  • Practice safe cleanup for Workbench Instances, training/tuning jobs, batch predictions, models, and endpoints.
  • Set a budget alert and apply labels to keep costs visible and predictable.
  • Use gcloud commands for auditing and rapid cleanup.

You’ve now run training jobs, tuning jobs, built a RAG pipeline, and possibly explored CLI workflows across the previous episodes. Before closing your laptop, let’s make sure none of those resources are still billing you — and learn the habits that prevent surprise charges going forward.

Check your current spend first


Before cleaning anything up, find out where you stand. Open the Cloud Console and navigate to:

Billing → Reports

  • Set the time range to This month (or Today for workshop use).
  • Group by Service to see which GCP services are costing the most.
  • Look for Compute Engine (backs Workbench VMs and training jobs), Vertex AI, and Cloud Storage.

This is the single most important dashboard to bookmark. If you only learn one thing from this episode, it’s where to find this page.

You can also check from the CLI:

BASH

# Quick check: is my project accumulating Vertex AI resources right now?
gcloud ai endpoints list --region=us-central1
gcloud workbench instances list --location=us-central1-a
gcloud ai custom-jobs list --region=us-central1 --filter="state=JOB_STATE_RUNNING"

What costs you money on GCP (and how much)


Not all resources cost equally. Here are the main cost drivers you’ll encounter in this workshop, ordered from most to least dangerous:

Resource Billing model Ballpark cost Risk level
Vertex AI endpoints Per node‑hour, 24/7 while deployed ~ $4.50/day for one n1-standard-4 node High — bills even with zero traffic
Workbench Instances (running) Per VM‑hour + GPU ~ $0.19/hr CPU‑only (n1-standard-4); add ~ $0.35/hr per T4 GPU High — easy to forget overnight
Training / HPT jobs Per VM/GPU‑hour while running Same VM rates; auto‑stops when done Medium — usually self‑limiting
Workbench disks (stopped VM) Per GB‑month for persistent disk ~ $0.04/GB/month (~ $4/month for 100 GB) Low — small but adds up
GCS storage Per GB‑month + operations + egress ~ $0.02/GB/month (Standard) Low — cheap until multi‑TB
Network egress Per GB downloaded out of GCP ~ $0.12/GB Low — avoid large downloads to local

Rule of thumb: Endpoints left deployed and notebooks left running are the most common surprise bills in education and research settings.

Shutting down Workbench Instances


In Episode 2 we created a Workbench Instance — the currently recommended notebook environment. Here’s how to stop or delete it:

Stop via Console

Vertex AI → WorkbenchInstances tab → select your instance → Stop.

Stop via CLI

BASH

# List all Workbench Instances in your zone
gcloud workbench instances list --location=us-central1-a

# Stop an instance (stops VM billing; disk charges continue)
gcloud workbench instances stop INSTANCE_NAME --location=us-central1-a

Delete when you’re done for good

BASH

# Permanently delete the instance and its disk
gcloud workbench instances delete INSTANCE_NAME --location=us-central1-a --quiet

You can configure your instance to auto‑stop after a period of inactivity, so you never accidentally leave it running overnight:

  • Console: Select your instance → Edit → set Idle shutdown to 60–120 minutes.
  • At creation time: Add --idle-shutdown-timeout=60 to your gcloud workbench instances create command.

Disks still cost money while the VM is stopped (~ $4/month for 100 GB). If you’re completely done with an instance, delete it rather than just stopping it.

Cleaning up training, tuning, and batch jobs


Training and HPT jobs automatically stop billing when they finish, but it’s good practice to audit for jobs stuck in RUNNING and to delete old jobs you no longer need.

Audit with CLI

BASH

# Custom training jobs
gcloud ai custom-jobs list --region=us-central1

# Hyperparameter tuning jobs
gcloud ai hp-tuning-jobs list --region=us-central1

# Batch prediction jobs
gcloud ai batch-prediction-jobs list --region=us-central1

Each command prints a table showing the job ID, display name, state (e.g., JOB_STATE_SUCCEEDED, JOB_STATE_RUNNING), and creation time. Look for any jobs stuck in RUNNING — those are still consuming resources.

Cancel or delete as needed

BASH

# Cancel a running job
gcloud ai custom-jobs cancel JOB_ID --region=us-central1

# Delete a completed job you no longer need
gcloud ai custom-jobs delete JOB_ID --region=us-central1

Tip: Keep one “golden” successful job per experiment for reference, then delete the rest to reduce console clutter.

Undeploy models and delete endpoints (major cost pitfall)


Deployed endpoints are billed per node‑hour 24/7, even with zero prediction traffic. A single forgotten endpoint can cost ~ $135/month. Always undeploy models before deleting the endpoint.

Find endpoints and deployed models

BASH

gcloud ai endpoints list --region=us-central1
gcloud ai endpoints describe ENDPOINT_ID --region=us-central1

Undeploy and delete

BASH

# Step 1: Undeploy the model (stops node-hour billing)
gcloud ai endpoints undeploy-model ENDPOINT_ID \
  --deployed-model-id=DEPLOYED_MODEL_ID \
  --region=us-central1 \
  --quiet

# Step 2: Delete the endpoint itself
gcloud ai endpoints delete ENDPOINT_ID \
  --region=us-central1 \
  --quiet

Model Registry note: Keeping a model registered (but not deployed to an endpoint) does not incur node‑hour charges. You only pay a small amount for the model artifact storage in GCS.

GCS housekeeping


Check bucket size

BASH

# Human-readable bucket size
gcloud storage du gs://YOUR_BUCKET --summarize --readable-sizes

# List top-level contents
gcloud storage ls gs://YOUR_BUCKET

Note: gsutil commands (e.g., gsutil du, gsutil ls) still work but are being replaced by gcloud storage. We use the newer syntax here.

Lifecycle policies

A lifecycle policy tells GCS to automatically delete or transition objects based on rules you define. This is useful for cleaning up temporary training outputs.

Save the following as lifecycle.json:

JSON

{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "Delete"},
        "condition": {"age": 7, "matchesPrefix": ["tmp/"]}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"numNewerVersions": 3}
      }
    ]
  }
}
  • Rule 1: Auto‑delete any object under tmp/ that is older than 7 days.
  • Rule 2: If versioning is enabled, keep only the 3 most recent versions.

Apply it:

BASH

gcloud storage buckets update gs://YOUR_BUCKET --lifecycle-file=lifecycle.json

# Verify
gcloud storage buckets describe gs://YOUR_BUCKET --format="yaml(lifecycle)"

Egress reminder

Downloading data out of GCP to your laptop costs ~ $0.12/GB. Prefer in‑cloud training and evaluation, and share results via GCS links rather than local downloads.

Labels and budgets


Standardize labels on all resources

Labels let you track costs per user, team, or experiment in billing reports. Apply them consistently:

  • Examples: name=firstname-lastname, purpose=workshop, dataset=titanic
  • The Vertex AI Python SDK supports labels on job creation; gcloud commands accept --labels=key=value,...

Set budget alerts (do this now)

This is the single most protective action you can take:

  1. Go to Billing → Budgets & alerts in the Cloud Console.
  2. Click Create budget.
  3. Set a budget amount (e.g., $10 or $25 for a workshop).
  4. Set alert thresholds at 50%, 80%, and 100%.
  5. Add forecast‑based alerts to catch trends before you hit the limit.
  6. Make sure email notifications go to all project maintainers, not just you.

For production use: You can export detailed billing data to BigQuery for cost analysis by service, label, or SKU. See the billing export documentation for setup instructions.

Common pitfalls and quick fixes


Pitfall Fix
Forgotten endpoint billing 24/7 Undeploy models → delete endpoint
Notebook left running over weekend Enable idle shutdown (60–120 min)
Duplicate datasets across buckets Consolidate to one bucket; set lifecycle to purge tmp/
Too many parallel HPT trials Cap parallel_trial_count to 2–4
Don’t know what’s costing money Check Billing → Reports; add labels to all resources
Callout

Going further: automating cleanup

Once you move from workshop use to regular research, consider automating resource cleanup:

  • Cloud Scheduler can run a nightly job to stop idle Workbench Instances via the Vertex AI API.
  • Cloud Functions or Cloud Run can periodically sweep for forgotten endpoints.
  • Budget alerts can trigger Pub/Sub messages that automatically shut down resources when spend exceeds a threshold.

These are beyond the scope of this workshop, but the Cloud Scheduler documentation is a good starting point.

Challenge

Challenge 1 — Check your spend and set a budget

  1. Navigate to Billing → Reports in the Cloud Console. Find your project’s current‑month spend grouped by service.
  2. Navigate to Billing → Budgets & alerts. Create a $10 budget with alert thresholds at 50% and 100%.
  1. In the Cloud Console, click the Navigation menu (☰)BillingReports. Set time range to “This month” and group by “Service.” You should see Compute Engine, Vertex AI, and Cloud Storage if you’ve been running workshop exercises.

  2. Go to BillingBudgets & alertsCreate budget. Set:

    • Name: workshop-budget
    • Amount: $10
    • Thresholds: 50% ($5) and 100% ($10)
    • Alerts to: your email address

Click Finish to activate the budget.

Challenge

Challenge 2 — Find and stop idle notebooks

List all running Workbench Instances in your zone and stop any you are not actively using.

BASH

gcloud workbench instances list --location=us-central1-a

BASH

# List instances — look for STATE=ACTIVE
gcloud workbench instances list --location=us-central1-a

# Stop an instance you're not using
gcloud workbench instances stop INSTANCE_NAME --location=us-central1-a

If the instance shows STATE=ACTIVE and you’re not currently working in it, stop it. You can restart it later with gcloud workbench instances start.

Challenge

Challenge 3 — Endpoint sweep

List all deployed endpoints in your region, undeploy any model you don’t need, and delete the endpoint.

BASH

# List all endpoints
gcloud ai endpoints list --region=us-central1

# Pick an endpoint ID from the list, then inspect it
gcloud ai endpoints describe ENDPOINT_ID --region=us-central1

# Undeploy the model (find DEPLOYED_MODEL_ID in the describe output)
gcloud ai endpoints undeploy-model ENDPOINT_ID \
  --deployed-model-id=DEPLOYED_MODEL_ID \
  --region=us-central1 \
  --quiet

# Delete the now-empty endpoint
gcloud ai endpoints delete ENDPOINT_ID \
  --region=us-central1 \
  --quiet
Challenge

Challenge 4 — Write and apply a lifecycle policy

Create a GCS lifecycle rule that deletes objects under tmp/ after 7 days and keeps only 3 versions of versioned objects. Apply it to your bucket.

Save the following as lifecycle.json:

JSON

{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "Delete"},
        "condition": {"age": 7, "matchesPrefix": ["tmp/"]}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"numNewerVersions": 3}
      }
    ]
  }
}

Apply and verify:

BASH

gcloud storage buckets update gs://YOUR_BUCKET --lifecycle-file=lifecycle.json
gcloud storage buckets describe gs://YOUR_BUCKET --format="yaml(lifecycle)"
Challenge

Challenge 5 — Full workshop teardown

If you are done with all episodes, perform a complete cleanup:

  1. Stop or delete your Workbench Instance.
  2. Verify no endpoints are deployed.
  3. Delete any completed training/tuning jobs you don’t need.
  4. Check your GCS bucket — remove any files you don’t want to keep, or delete the bucket entirely.

BASH

# 1. Delete your Workbench Instance
gcloud workbench instances delete INSTANCE_NAME \
  --location=us-central1-a --quiet

# 2. Confirm no endpoints remain
gcloud ai endpoints list --region=us-central1
# (If any appear, undeploy models and delete them as shown above)

# 3. Delete old training jobs
gcloud ai custom-jobs list --region=us-central1
gcloud ai custom-jobs delete JOB_ID --region=us-central1

gcloud ai hp-tuning-jobs list --region=us-central1
gcloud ai hp-tuning-jobs delete JOB_ID --region=us-central1

# 4. Remove your GCS bucket (WARNING: this deletes all data in the bucket)
gcloud storage rm -r gs://YOUR_BUCKET

After cleanup, check Billing → Reports one more time to confirm no services are still accumulating charges.

End‑of‑session checklist


Before you close your laptop, run through this quick checklist:

  1. Workbench Instances — stopped (or deleted if you’re done for good).
  2. Training / HPT jobs — no jobs stuck in RUNNING.
  3. Endpoints — all models undeployed; unused endpoints deleted.
  4. GCS — no large temporary files lingering; lifecycle policy in place.
  5. Budget alert — set and sending to your email.

Bookmark Billing → Reports and check it at the start of each session. A 10‑second glance can save you from a surprise bill.

Key Points
  • Check Billing → Reports regularly — know what you’re spending before it surprises you.
  • Endpoints and running notebooks are the most common cost leaks; undeploy and stop first.
  • Set a budget alert — it’s the single most protective action you can take.
  • Configure idle shutdown on Workbench Instances so forgotten notebooks auto‑stop.
  • Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
  • Use labels on all resources so you can trace costs in billing reports.