Content from Overview of Google Cloud for Machine Learning


Last updated on 2025-10-24 | Edit this page

Overview

Questions

  • What problem does GCP aim to solve for ML researchers?
  • How does using a notebook as a controller help organize ML workflows in the cloud?
  • How does GCP compare to AWS for ML workflows?

Objectives

  • Understand the basic role of GCP in supporting ML research.
  • Recognize how a notebook can serve as a controller for cloud resources.
  • Compare GCP and AWS approaches to building and managing ML workflows.

Google Cloud Platform (GCP) provides the basic building blocks researchers need to run machine learning (ML) experiments at scale. Instead of working only on your laptop or a high-performance computing (HPC) cluster, you can spin up compute resources on demand, store datasets in the cloud, and run low-cost notebooks that act as a “controller” for larger training and tuning jobs.

This workshop focuses on using a simple notebook environment as the control center for your ML workflow. We will not rely on Google’s fully managed Vertex AI platform, but instead show how to use core GCP services (Compute Engine, storage buckets, and SDKs) so you can build and run experiments from scratch.

Why use GCP for machine learning?

GCP provides several advantages that make it a strong option for applied ML:

  • Flexible compute: You can choose the hardware that fits your workload:

    • CPUs for lightweight models, preprocessing, or feature engineering.
    • GPUs (e.g., NVIDIA T4, V100, A100) for training deep learning models.
    • TPUs (Tensor Processing Units) for TensorFlow or JAX-based deep learning. TPUs are custom Google hardware optimized for matrix operations and can provide strong performance and energy efficiency for compatible workloads. Google has reported better performance-per-watt compared to GPUs in many TensorFlow benchmarks, though these gains depend heavily on model type and implementation.
      Historically, TPU support has been limited for PyTorch users, and while Google is improving PyTorch integration, the TPU ecosystem still works best for TensorFlow and JAX workflows.
  • Data storage and access: Google Cloud Storage (GCS) buckets act like S3 on AWS — an easy way to store and share datasets between experiments and collaborators.

  • From scratch workflows: Instead of depending on a fully managed ML service, you bring your own frameworks (PyTorch, TensorFlow, scikit-learn, etc.) and run your code the same way you would on your laptop or HPC cluster, but with scalable cloud resources.

  • Cost visibility: Billing dashboards and project-level budgets make it easier to track costs and stay within research budgets.

  • Sustainability focus: Google aims to operate entirely on carbon-free energy by 2030. Combined with the TPU’s focus on efficient matrix computation, this gives GCP a potential edge for researchers interested in energy-conscious ML — though real-world energy efficiency varies by workload and utilization.

In short, GCP provides infrastructure that you control from a notebook environment, allowing you to build and run ML workflows just as you would locally, but with access to scalable hardware and storage.

Callout

What about AWS?

In many respects, GCP and AWS offer comparable capabilities for ML research. Both provide scalable compute, storage, and tooling to support everything from quick experiments to production pipelines.
AWS typically offers a broader range of GPU and CPU instance types, along with mature managed services like SageMaker and tighter integration with enterprise infrastructure. GCP, on the other hand, emphasizes the use of TensorFlow and JAX, and the availability of TPUs — which may offer energy advantages for certain workloads.

Ultimately, the choice often comes down to framework preference, familiarity, and existing resources, rather than major functional differences between the two platforms.

Discussion

Comparing infrastructures

Think about your current research setup:
- Do you mostly use your laptop, HPC cluster, AWS, or GCP for ML experiments?
- Which environment feels most transparent for understanding costs and reproducibility?
- If you could offload one infrastructure challenge (e.g., installing GPU drivers, managing storage, or setting up environments), what would it be and why?

Take 3–5 minutes to discuss with a partner or share in the workshop chat.

Key Points
  • GCP and AWS both provide the essential components for running ML workloads at scale.
  • GCP emphasizes simplicity, open frameworks, and TPU access; AWS offers broader hardware and automation options.
  • TPUs are efficient for TensorFlow and JAX, but GPU-based workflows (common on AWS) remain more flexible across frameworks.
  • Both platforms now provide strong cost tracking and sustainability tools, with only minor differences in interface and ecosystem integration.
  • Using a notebook as a controller provides reproducibility and helps manage compute and storage resources consistently across clouds.

Content from Data Storage: Setting up GCS


Last updated on 2025-10-30 | Edit this page

Overview

Questions

  • How can I store and manage data effectively in GCP for Vertex AI workflows?
  • What are the advantages of Google Cloud Storage (GCS) compared to local or VM storage for machine learning projects?

Objectives

  • Explain data storage options in GCP for machine learning projects.
  • Describe the advantages of GCS for large datasets and collaborative workflows.
  • Outline steps to set up a GCS bucket and manage data within Vertex AI.

Machine learning and AI projects rely on data, making efficient storage and management essential. Google Cloud offers several storage options, but the most common for ML workflows are Virtual Machine (VM) disks and Google Cloud Storage (GCS) buckets.

Consult your institution’s IT before handling sensitive data in GCP

As with AWS, do not upload restricted or sensitive data to GCP services unless explicitly approved by your institution’s IT or cloud security team. For regulated datasets (HIPAA, FERPA, proprietary), work with your institution to ensure encryption, restricted access, and compliance with policies.

Options for storage: VM Disks or GCS


What is a VM disk?

A VM disk is the storage volume attached to a Compute Engine VM or a Vertex AI Workbench notebook. It can store datasets and intermediate results, but it is tied to the lifecycle of the VM.

When to store data directly on a VM disk

  • Useful for small, temporary datasets processed interactively.
  • Data persists if the VM is stopped, but storage costs continue as long as the disk exists.
  • Not ideal for collaboration, scaling, or long-term dataset storage.
Callout

Limitations of VM disk storage

  • Scalability: Limited by disk size quota.
  • Sharing: Harder to share across projects or team members.
  • Cost: More expensive per GB compared to GCS for long-term storage.

What is a GCS bucket?

For most ML workflows in GCP, Google Cloud Storage (GCS) buckets are recommended. A GCS bucket is a container in Google’s object storage service where you can store an essentially unlimited number of files. Data in GCS can be accessed from Vertex AI training jobs, Workbench notebooks, and other GCP services using a GCS URI (e.g., gs://your-bucket-name/your-file.csv).


To upload our Titanic dataset to a GCS bucket, we’ll follow these steps:

  1. Log in to the Google Cloud Console.
  2. Create a new bucket (or use an existing one).
  3. Upload your dataset files.
  4. Use the GCS URI to reference your data in Vertex AI workflows.

1. Sign in to Google Cloud Console

  • In the search bar, type Storage.
  • Click Cloud Storage > Buckets.

3. Create a new bucket

  • Click Create bucket.

3a. Getting Started (bucket name and tags)

  • Provide a bucket name: Enter a globally unique name. For this workshop, we can use the following naming convention to easily locate our buckets: teamname-firstlastname-dataname (e.g., sinkorswim-johndoe-titanic)
  • Add labels (tags) to track costs: Add labels to track resource usage and billing. If you’re working in a shared account, this step is mandatory. If not, it’s still recommended to help you track your own costs!
    • project = teamname (your team’s name)
    • `name = firstname-lastname’ (your name)
    • purpose=bucket-dataname (include bucket- prefix followed by name of dataset)
Screenshot showing required tags for a GCS bucket
Example of Tags for a GCS Bucket

3b. Choose where to store your data

When creating a storage bucket in Google Cloud, the best practice for most machine learning workflows is to use a regional bucket in the same region as your compute resources (for example, us-central1). This setup provides the lowest latency and avoids network egress charges when training jobs read from storage, while also keeping costs predictable. A multi-region bucket, on the other hand, can make sense if your primary goal is broad availability or if collaborators in different regions need reliable access to the same data; the trade-off is higher cost and the possibility of extra egress charges when pulling data into a specific compute region. For most research projects, a regional bucket with the Standard storage class, uniform access control, and public access prevention enabled offers a good balance of performance, security, and affordability.

  • Region (cheapest, good default). For instance, us-central1 (Iowa) costs $0.020 per GB-month.
  • Multi-region (higher redundancy, more expensive).
Choose where to store your data
Choose where to store your data

3c. Choose how to store your data (storage class)

When creating a bucket, you’ll be asked to choose a storage class, which determines how much you pay for storing data and how often you’re allowed to access it without extra fees.

  • Standard – best for active ML/AI workflows. Training data is read and written often, so this is the safest default.
  • Nearline / Coldline / Archive – designed for backups or rarely accessed files. These cost less per GB to store, but you pay retrieval fees if you read them during training. Not recommended for most ML projects where data access is frequent.

You may see an option to “Enable hierarchical namespace”. GCP now offers an option to enable a hierarchical namespace for buckets, but this is mainly useful for large-scale analytics pipelines. For most ML workflows, the standard flat namespace is simpler and fully compatible—so it’s best to leave this option off.

3d. Choose how to control access to objects

For ML projects, you should prevent public access so that only authorized users can read or write data. This keeps research datasets private and avoids accidental exposure.

When prompted to choose an access control method, choose uniform access unless you have a very specific reason to manage object-level permissions.

  • Uniform access (recommended): Simplifies management by enforcing permissions at the bucket level using IAM roles. It’s the safer and more maintainable choice for teams and becomes permanent after 90 days.
  • Fine-grained access: Allows per-file permissions using ACLs, but adds complexity and is rarely needed outside of web hosting or mixed-access datasets.

3e. Choose how to protect object data

GCP automatically protects all stored data, but you can enable additional layers of protection depending on your project’s needs. For most ML or research workflows, you’ll want to balance safety with cost.

  • Soft delete policy (recommended for shared projects): Keeps deleted objects recoverable for a short period (default is 7 days). This is useful if team members might accidentally remove data. You can set a custom retention duration, but longer windows increase storage costs.
  • Object versioning: Creates new versions of files when they’re modified or overwritten. This can be helpful for tracking model outputs or experiment logs but may quickly increase costs. Enable only if you expect frequent overwrites and need rollback capability.
  • Retention policy (for compliance use only): Prevents deletion or modification of objects for a fixed time window. This is typically required for regulated data but should be avoided for active ML projects, since it can block normal cleanup and retraining workflows.

In short: keep the default soft delete unless you have specific compliance requirements. Use object versioning sparingly, and avoid retention locks unless mandated by policy.

Final check

After configuring all settings, your bucket settings preview should look similar to the screenshot below (with the bucket name adjusted for your name).

Recommended GCS bucket settings.
Final GCS Bucket Settings

Click Create if everything looks good.

4. Upload files to the bucket

  • If you haven’t yet, download the data for this workshop (Right-click → Save as):
    data.zip
    • Extract the zip folder contents (Right-click → Extract all on Windows; double-click on macOS).
  • In the bucket dashboard, click Upload Files.
  • Select your Titanic CSVs and upload.

Note the GCS URI for your data After uploading, click on a file and find its gs:// URI (e.g., gs://sinkorswim-johndoe-titanic/titanic_test.csv). This URI will be used to access the data later.

Adjust bucket permissions


Return to the Google Cloud Console (where we created our bucket and VM) and search for “Cloud Shell Editor”. Open a shell editor and run the below command, *replacing the bucket name with your bucket’s name`:

SH

# Grant read permisssions on the bucket
gcloud storage buckets add-iam-policy-binding gs://sinkorswim-johndoe-titanic \
  --member="serviceAccount:549047673858-compute@developer.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

# Grant write permisssions on the bucket
gcloud storage buckets add-iam-policy-binding gs://sinkorswim-johndoe-titanic \
  --member="serviceAccount:549047673858-compute@developer.gserviceaccount.com" \
  --role="roles/storage.objectCreator"

# (Only if you also need overwrite/delete)
gcloud storage buckets add-iam-policy-binding gs://sinkorswim-johndoe-titanic \
  --member="serviceAccount:549047673858-compute@developer.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

This grants our future VMs permission to read objects from the bucket.

Data transfer & storage costs explained


GCS costs are based on storage class, data transfer, and operations (requests).

  • Standard storage: Data storage cost is based on region. In us-central1, the cost is ~$0.02 per GB per month.
  • Uploading data (ingress): Copying data into a GCS bucket from your laptop, campus HPC, or another provider is free.
  • Downloading data out of GCP (egress): Refers to data leaving Google’s network to the public internet, such as downloading files from GCS to your local machine. Typical cost is around $0.12 per GB to the U.S. and North America, more for other continents.
    • Cross-region access: If your bucket is in one region and your compute runs in another, you’ll pay an egress fee of about $0.01–0.02 per GB within North America, higher if crossing continents.
  • Reading (GET) requests: Each read or list operation incurs a small API request fee of roughly $0.004 per 10,000 requests.
    • Example: a training job that loads 10,000 image samples from GCS (one per batch) would make about 10,000 GET requests, costing around $0.004 total. Reading a large file such as a 1 GB CSV or TFRecord shard counts as a single GET request.
  • Writing (PUT/POST/LIST) requests: Uploading, creating, or modifying objects costs about $0.05 per 10,000 requests.
    • Example: saving one model checkpoint file (e.g., model-weights.h5 or model.pt) triggers one PUT request. A training pipeline that saves a few dozen checkpoints or logs would cost well under one cent in request fees.
  • Deleting data: Removing objects or buckets does not incur transfer costs. If you download data before deleting, you pay for the egress, but deleting directly in the console or CLI is free. For Nearline, Coldline, or Archive storage classes, deleting before the minimum storage duration (30, 90, or 365 days) triggers an early-deletion fee.

For detailed pricing, see GCS Pricing Information.

Challenge

Challenge: Estimating Storage Costs

1. Estimate the total cost of storing 1 GB in GCS Standard storage (us-central1) for one month assuming:
- Storage duration: 1 month
- Dataset retrieved 100 times for model training and tuning
- Data is downloaded once out of GCP at the end of the project

Hints
- Storage cost: $0.02 per GB per month
- Egress (download out of GCP): $0.12 per GB
- GET requests: $0.004 per 10,000 requests (100 requests ≈ free for our purposes)

2. Repeat the above calculation for datasets of 10 GB, 100 GB, and 1 TB (1024 GB).

  1. 1 GB:
  • Storage: 1 GB × $0.02 = $0.02
  • Egress: 1 GB × $0.12 = $0.12
  • Requests: ~0 (100 reads well below pricing tier)
  • Total: $0.14
  1. 10 GB:
  • Storage: 10 GB × $0.02 = $0.20
  • Egress: 10 GB × $0.12 = $1.20
  • Requests: ~0
  • Total: $1.40
  1. 100 GB:
  • Storage: 100 GB × $0.02 = $2.00
  • Egress: 100 GB × $0.12 = $12.00
  • Requests: ~0
  • Total: $14.00
  1. 1 TB (1024 GB):
  • Storage: 1024 GB × $0.02 = $20.48
  • Egress: 1024 GB × $0.12 = $122.88
  • Requests: ~0
  • Total: $143.36

Removing unused data (complete after the workshop)


After you are done using your data, remove unused files/buckets to stop costs:

  • Option 1: Delete files only – if you plan to reuse the bucket.
  • Option 2: Delete the bucket entirely – if you no longer need it.

When does BigQuery come into play?


BigQuery is Google Cloud’s managed data warehouse for storing and analyzing large tabular datasets using SQL. It’s designed for interactive querying and analytics rather than file storage. For most ML workflows—especially smaller projects or those focused on images, text, or modest tabular data—BigQuery isn’t needed. Cloud Storage (GCS) buckets are usually enough: they can store data efficiently and let you stream files directly into your training code without downloading them locally.

BigQuery becomes useful when you’re working with large, structured datasets that multiple team members need to query or explore collaboratively. Instead of reading entire files, you can use SQL to retrieve only the subset of data you need. Teams can share results through saved queries or views and control access at the dataset or table level with IAM. BigQuery also integrates with Vertex AI, allowing structured data stored there to connect directly to training pipelines. The main trade-off is cost: you pay for both storage and the amount of data scanned by queries.

In short, use GCS buckets for storing and streaming files into typical ML workflows, and consider BigQuery when you need a shared, queryable workspace for large tabular datasets.

Key Points
  • Use GCS for scalable, cost-effective, and persistent storage in GCP.
  • Persistent disks are suitable only for small, temporary datasets.
  • Track your storage, transfer, and request costs to manage expenses.
  • Regularly delete unused data or buckets to avoid ongoing costs.

Content from Notebooks as Controllers


Last updated on 2025-10-30 | Edit this page

Overview

Questions

  • How do you set up and use Vertex AI Workbench notebooks for machine learning tasks?
  • How can you manage compute resources efficiently using a “controller” notebook approach in GCP?

Objectives

  • Describe how to use Vertex AI Workbench notebooks for ML workflows.
  • Set up a Jupyter-based Workbench instance as a controller to manage compute tasks.
  • Use the Vertex AI SDK to launch training and tuning jobs on scalable instances.

Setting up our notebook environment


Google Cloud Workbench provides JupyterLab-based environments that can be used to orchestrate machine learning workflows. In this workshop, we will use a Workbench Instance—the recommended option going forward, as other Workbench environments are being deprecated.

Workbench Instances come with JupyterLab 3 pre-installed and are configured with GPU-enabled ML frameworks (TensorFlow, PyTorch, etc.), making it easy to start experimenting without additional setup. Learn more in the Workbench Instances documentation.

Using the notebook as a controller


The notebook instance functions as a controller to manage more resource-intensive tasks. By selecting a modest machine type (e.g., n1-standard-4), you can perform lightweight operations locally in the notebook while using the Vertex AI Python SDK to launch compute-heavy jobs on larger machines (e.g., GPU-accelerated) when needed.

This approach minimizes costs while giving you access to scalable infrastructure for demanding tasks like model training, batch prediction, and hyperparameter tuning.

We will follow these steps to create our first Workbench Instance:

  • In the Google Cloud Console, search for “Workbench.”
  • Click the “Instances” tab (this is the supported path going forward).
  • Pin Workbench to your navigation bar for quick access.

2. Create a new Workbench Instance

Initial settings

  • Click Create New near the top of the Workbench page
  • Name: For this workshop, we can use the following naming convention to easily locate our notebooks: teamname-yourname-purpose (e.g., sinkorswim-johndoe-train)
  • Region: Choose the same region as your storage bucket (e.g., us-central1). This avoids cross-region transfer charges and keeps data access latency low.
    • If you are unsure, check your bucket’s location in the Cloud Storage console (click the bucket name → look under “Location”).
  • Zone: us-central1-a (or another zone in us-central1, like -b or -c)
    • If capacity or GPU availability is limited in one zone, switch to another zone in the same region.
  • NVIDIA T4 GPU: Leave unchecked for now
    • We will request GPUs for training jobs separately. Attaching here increases idle costs.
  • Apache Spark and BigQuery Kernels: Leave unchecked
    • Enable only if you specifically need Spark or BigQuery notebooks; otherwise, it adds unnecessary images.
  • Network in this project: Required selection
    • This option must be selected; shared environments do not allow using external or default networks.
    • This ensures your instance connects to the shared VPC for the workshop.
  • Network / Subnetwork: Leave as pre-filled. Notebook settings (part1)

Advanced settings: Details (tagging)

  • IMPORTANT: Open the “Advanced optoins menu next
    • Labels (required for cost tracking): Under the Details menu, add the following tags (all lowercase) so that you can track the total cost of your activity on GCP later:
      • project = teamname (your team’s name)
      • name = name (firstname-lastname)
      • purpose = train (i.e., the notebook’s overall purpose — train, tune, RAG, etc.)
Screenshot showing required tags for notebook
Required tags for notebook.

Advanced Settings: Environment

While we won’t modify environment settings during this workshop, it’s useful to understand what these options control when creating or editing a Workbench Instance in Vertex AI Workbench.

All Workbench environments use JupyterLab 3 by default, with the latest NVIDIA GPU drivers, CUDA libraries, and Intel optimizations preinstalled. You can optionally select JupyterLab 4 (currently in preview) or provide a custom container image to run your own environment (for example, a Docker image containing specialized ML frameworks or dependencies). If needed, you can also specify a post-startup script stored in Cloud Storage (gs://path/to/script.sh) to automatically configure the instance (install packages, mount buckets, etc.) when it boots.

See: Vertex AI Workbench release notes for supported versions and base images.

Advanced settings: Machine Type

  • Machine type: Select a small machine (e.g., n2-standard-2) to act as the controller.
    • This keeps costs low while you delegate heavy lifting to training jobs.
    • For guidance on common machine types for ML, refer to Instances for ML on GCP.
  • Set idle shutdown: To save on costs when you aren’t doing anything in your notebook, lower the default idle shutdown time to 60 (minutes).
Set Idle Shutdown
Enable Idle Shutdown

Advanced Settings: Disks

Each Vertex AI Workbench instance uses Persistent Disks (PDs) to store your system files and data. You’ll configure two disks when creating a notebook: a boot disk and a data disk. We’ll leave these at their default settings, but it’s useful to understand the settings for future work.

Boot Disk

Keep this fixed at 100 GB (Balanced PD) — the default minimum.
It holds the OS, JupyterLab, and ML libraries but not your datasets.
Estimated cost: about $10 / month (~$0.014 / hr).
You rarely need to resize this, though you can increase to 150–200 GB if you plan to install large environments, custom CUDA builds, or multiple frameworks.

Data Disk

This is where your datasets, checkpoints, and outputs live.
Use a Balanced PD by default, or an SSD PD only for high-I/O workloads.
A good rule of thumb is to allocate ≈ 2× your dataset size, with a minimum of 150 GB and a maximum of 1 TB.
For example: - 20 GB dataset → 150 GB data disk (minimum)
- 100 GB dataset → 200 GB data disk
- Larger datasets → keep the full dataset in Cloud Storage (gs://) and copy only subsets locally.

Persistent Disks can be resized anytime without downtime, so it’s best to start small and expand when needed.

Deletion behavior

The ‘Delete to trash’ option is unchecked by default, which is what you want.
When left unchecked, deleted files are removed immediately, freeing up disk space right away.
If you check this box, files will move to the system trash instead — meaning they still take up space (and cost) until you empty it.

Keep this unchecked to avoid paying for deleted files that remain in the trash.

Cost awareness

Persistent Disks are fast but cost more than Cloud Storage.
Typical rates:
- Balanced PD: ~$0.10–$0.12 / GB / month
- SSD PD: ~$0.17–$0.20 / GB / month
- Cloud Storage (Standard): ~$0.026 / GB / month

Rule of thumb: use PDs only for active work; store everything else in Cloud Storage.
Example: a 200 GB dataset costs ~$24/month on a PD but only ~$5/month in Cloud Storage.

Check the latest pricing here:
- Persistent Disk & Image pricing
- Cloud Storage pricing

Advanced settings: Networking - External IP Access

  • Assign External IP address: Leave this option checked — you need an external IP.

Create notebook

  • Click Create to create the instance. Your notebook instance will start in a few minutes. When its status is “Running,” you can open JupyterLab and begin working.

Managing training and tuning with the controller notebook

In the following episodes, we will use the Vertex AI Python SDK (google-cloud-aiplatform) from this notebook to submit compute-heavy tasks on more powerful machines. Examples include:

  • Training a model on a GPU-backed instance.
  • Running hyperparameter tuning jobs managed by Vertex AI.

This pattern keeps costs low by running your notebook on a modest VM while only incurring charges for larger resources when they are actively in use.

Challenge

Challenge: Notebook Roles

Your university provides different compute options: laptops, on-prem HPC, and GCP.

  • What role does a Workbench Instance notebook play compared to an HPC login node or a laptop-based JupyterLab?
  • Which tasks should stay in the notebook (lightweight control, visualization) versus being launched to larger cloud resources?

The notebook serves as a lightweight control plane.
- Like an HPC login node, it is not meant for heavy computation.
- Suitable for small preprocessing, visualization, and orchestrating jobs.
- Resource-intensive tasks (training, tuning, batch jobs) should be submitted to scalable cloud resources (GPU/large VM instances) via the Vertex AI SDK.

Load pre-filled Jupyter notebooks

Once your newly created instance shows as Active (green checkmark), click Open JupyterLab to open the instance in Jupyter Lab. From there, we can create as many Jupyter notebooks as we would like within the instance environment.

We will then select the standard python3 environment to start our first .ipynb notebook (Jupyter notebook). We can use this environment since we aren’t doing any training/tuning just yet.

Within the Jupyter notebook, run the following command to clone the lesson repo into our Jupyter environment:

SH

!git clone https://github.com/qualiaMachine/Intro_GCP_for_ML.git

Then, navigate to /Intro_GCP_for_ML/notebooks/04-Accessing-and-managing-data.ipynb to begin the first notebook.

Key Points
  • Use a small Workbench Instance notebook as a controller to manage larger, resource-intensive tasks.
  • Always navigate to the “Instances” tab in Workbench, since older notebook types are deprecated.
  • Choose the same region for your Workbench Instance and storage bucket to avoid extra transfer costs.
  • Submit training and tuning jobs to scalable instances using the Vertex AI SDK.
  • Labels help track costs effectively, especially in shared or multi-project environments.
  • Workbench Instances come with JupyterLab 3 and GPU frameworks preinstalled, making them an easy entry point for ML workflows.
  • Enable idle auto-stop to avoid unexpected charges when notebooks are left running.

Content from Accessing and Managing Data in GCS with Vertex AI Notebooks


Last updated on 2025-10-27 | Edit this page

Overview

Questions

  • How can I load data from GCS into a Vertex AI Workbench notebook?
  • How do I monitor storage usage and costs for my GCS bucket?
  • What steps are involved in pushing new data back to GCS from a notebook?

Objectives

  • Read data directly from a GCS bucket into memory in a Vertex AI notebook.
  • Check storage usage and estimate costs for data in a GCS bucket.
  • Upload new files from the Vertex AI environment back to the GCS bucket.

Initial setup


Load pre-filled Jupyter notebooks

If you haven’t opened your newly created VM from the last episode yet, lick Open JupyterLab to open the instance in Jupyter Lab. From there, we can create as many Jupyter notebooks as we would like within the instance environment.

We will then select the standard python3 environment to start our first .ipynb notebook (Jupyter notebook). We can use this environment since we aren’t doing any training/tuning just yet.

Within the Jupyter notebook, run the following command to clone the lesson repo into our Jupyter environment:

SH

!git clone https://github.com/qualiaMachine/Intro_GCP_for_ML.git

Then, navigate to /Intro_GCP_for_ML/notebooks/04-Accessing-and-managing-data.ipynb to begin the first notebook.

Set up GCP environment

Before interacting with GCS, we need to authenticate and initialize the client libraries. This ensures our notebook can talk to GCP securely.

PYTHON

from google.cloud import storage
client = storage.Client()
print("Project:", client.project)

Reading data from Google Cloud Storage (GCS)


Similar to other cloud vendors, we can either (A) read data directly from Google Cloud Storage (GCS) into memory, or (B) download a copy into your notebook VM. Since we’re using notebooks as controllers rather than training environments, the recommended approach is reading directly from GCS into memory.

A) Reading data directly into memory

PYTHON

import pandas as pd
import io

bucket_name = "sinkorswim-johndoe-titanic" # ADJUST to your bucket's name

bucket = client.bucket(bucket_name)
blob = bucket.blob("titanic_train.csv")
train_data = pd.read_csv(io.BytesIO(blob.download_as_bytes()))
print(train_data.shape)
train_data.head()

If you get an error, return to the Google Cloud Console (where we created our bucket and VM) and search for “Cloud Shell Editor”. Open a shell editor and run the below commands, *replacing the bucket name with your bucket’s name`:

SH

# Grant read permisssions on the bucket
gcloud storage buckets add-iam-policy-binding gs://sinkorswim-johndoe-titanic \
  --member="serviceAccount:549047673858-compute@developer.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

# Grant write permisssions on the bucket
gcloud storage buckets add-iam-policy-binding gs://sinkorswim-johndoe-titanic \
  --member="serviceAccount:549047673858-compute@developer.gserviceaccount.com" \
  --role="roles/storage.objectCreator"

# (Only if you also need overwrite/delete)
gcloud storage buckets add-iam-policy-binding gs://sinkorswim-johndoe-titanic \
  --member="serviceAccount:549047673858-compute@developer.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

B) Downloading a local copy

If you prefer, you can download the file from your bucket to the notebook VM’s local disk. This makes repeated reads faster within our notebook environment, but note that each download counts as a “GET” request and may incur a small data transfer (egress) cost if the bucket and VM are in different regions. If both are in the same region, there are no transfer fees — only standard request costs (typically fractions of a cent).

Let’s verify what our path looks like first.

!pwd

PYTHON

blob_name = "titanic_train.csv"
local_path = "/home/jupyter/titanic_train.csv"

bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.download_to_filename(local_path)

!ls -lh /home/jupyter/

Checking storage usage of a bucket


PYTHON

total_size_bytes = 0
bucket = client.bucket(bucket_name)

for blob in client.list_blobs(bucket_name):
    total_size_bytes += blob.size

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{bucket_name}': {total_size_mb:.2f} MB")

Estimating storage costs


PYTHON

storage_price_per_gb = 0.02  # $/GB/month for Standard storage
total_size_gb = total_size_bytes / (1024**3)
monthly_cost = total_size_gb * storage_price_per_gb

print(f"Estimated monthly cost: ${monthly_cost:.4f}")
print(f"Estimated annual cost: ${monthly_cost*12:.4f}")

For updated prices, see GCS Pricing.

Writing output files to GCS


Create a sample file on the notebook VM’s storage.

PYTHON

# Create a sample file locally on the notebook VM
file_path = "/home/jupyter/Notes.txt"
with open(file_path, "w") as f:
    f.write("This is a test note for GCS.")

!ls /home/jupyter

Upload file.

PYTHON

# Point to the right bucket
bucket = client.bucket(bucket_name)

# Create a *Blob* object, which represents a path inside the bucket
# (here it will end up as gs://<bucket_name>/docs/Notes.txt)
blob = bucket.blob("docs/Notes.txt")

# Upload the local file into that blob (object) in GCS
blob.upload_from_filename(file_path)

print("File uploaded successfully.")

List bucket contents:

PYTHON

for blob in client.list_blobs(bucket_name):
    print(blob.name)
Challenge

Challenge: Estimating GCS Costs

Suppose you store 50 GB of data in Standard storage (us-central1) for one month.
- Estimate the monthly storage cost.
- Then estimate the cost if you download (egress) the entire dataset once at the end of the month.

Hints
- Storage: $0.02 per GB-month
- Egress: $0.12 per GB

  • Storage cost: 50 GB × $0.02 = $1.00
  • Egress cost: 50 GB × $0.12 = $6.00
  • Total cost: $7.00 for one month including one full download
Key Points
  • Load data from GCS into memory to avoid managing local copies when possible.
  • Periodically check storage usage and costs to manage your GCS budget.
  • Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.

Content from Using a GitHub Personal Access Token (PAT) to Push/Pull from a Vertex AI Notebook


Last updated on 2025-10-24 | Edit this page

Overview

Questions

  • How can I securely push/pull code to and from GitHub within a Vertex AI Workbench notebook?
  • What steps are necessary to set up a GitHub PAT for authentication in GCP?
  • How can I convert notebooks to .py files and ignore .ipynb files in version control?

Objectives

  • Configure Git in a Vertex AI Workbench notebook to use a GitHub Personal Access Token (PAT) for HTTPS-based authentication.
  • Securely handle credentials in a notebook environment using getpass.
  • Convert .ipynb files to .py files for better version control practices in collaborative projects.

Step 0: Initial setup


In the previous episode, we cloned our forked repository as part of the workshop setup. In this episode, we’ll see how to push our code to this fork. Complete these three setup steps before moving forward.

  1. Clone the fork if you haven’t already. See previous episode.

  2. Start a new Jupyter notebook, and name it something like Interacting-with-git.ipynb. We can use the default Python 3 kernel in Vertex AI Workbench.

  3. Change directory to the workspace where your repository is located. In Vertex AI Workbench, notebooks usually live under /home/jupyter/.

PYTHON

%cd /home/jupyter/

Step 1: Using a GitHub personal access token (PAT) to push/pull from a Vertex AI notebook


When working in Vertex AI Workbench notebooks, you may often need to push code updates to GitHub repositories. Since Workbench VMs may be stopped and restarted, configurations like SSH keys may not persist. HTTPS-based authentication with a GitHub Personal Access Token (PAT) is a practical solution. PATs provide flexibility for authentication and enable seamless interaction with both public and private repositories directly from your notebook.

Important Note: Personal access tokens are powerful credentials. Select the minimum necessary permissions and handle the token carefully.

Generate a personal access token (PAT) on GitHub

  1. Go to Settings in GitHub.
  2. Click Developer settings at the bottom of the left sidebar.
  3. Select Personal access tokens, then click Tokens (classic).
  4. Click Generate new token (classic).
  5. Give your token a descriptive name and set an expiration date if desired.
  6. Select minimum permissions:
    • Public repos: public_repo
    • Private repos: repo
  7. Click Generate token and copy it immediately—you won’t be able to see it again.

Caution: Treat your PAT like a password. Don’t share it or expose it in your code. Use a password manager to store it.

Use getpass to prompt for username and PAT

PYTHON

import getpass

# Prompt for GitHub username and PAT securely
username = input("GitHub Username: ")
token = getpass.getpass("GitHub Personal Access Token (PAT): ")

This way credentials aren’t hard-coded into your notebook.

Step 2: Configure Git settings


PYTHON

!git config --global user.name "Your Name" 
!git config --global user.email your_email@wisc.edu
  • user.name: Will appear in the commit history.
  • user.email: Must match your GitHub account so commits are linked to your profile.

Step 3: Convert .ipynb notebooks to .py


Tracking .py files instead of .ipynb helps with cleaner version control. Notebooks store outputs and metadata, which makes diffs noisy. .py files are lighter and easier to review.

  1. Install Jupytext.

PYTHON

!pip install jupytext
  1. Convert a notebook to .py.

PYTHON

!jupytext --to py Interacting-with-GCS.ipynb
  1. Convert all notebooks in the current directory.

PYTHON

import subprocess, os

for nb in [f for f in os.listdir() if f.endswith('.ipynb')]:
    pyfile = nb.replace('.ipynb', '.py')
    subprocess.run(["jupytext", "--to", "py", nb, "--output", pyfile])
    print(f"Converted {nb} to {pyfile}")

Step 4: Add and commit .py files


PYTHON

%cd /home/jupyter/your-repo
!git status
!git add .
!git commit -m "Converted notebooks to .py files for version control"

Step 5: Add .ipynb to .gitignore


PYTHON

!touch .gitignore
with open(".gitignore", "a") as gitignore:
    gitignore.write("\n# Ignore Jupyter notebooks\n*.ipynb\n")
!cat .gitignore

Add other temporary files too:

PYTHON

with open(".gitignore", "a") as gitignore:
    gitignore.write("\n# Ignore cache and temp files\n__pycache__/\n*.tmp\n*.log\n")

Commit the .gitignore:

PYTHON

!git add .gitignore
!git commit -m "Add .ipynb and temp files to .gitignore"

Step 6: Syncing with GitHub


First, pull the latest changes:

PYTHON

!git config pull.rebase false
!git pull origin main

If conflicts occur, resolve manually before committing.

Then push with your PAT credentials:

PYTHON

github_url = f'github.com/{username}/your-repo.git'
!git push https://{username}:{token}@{github_url} main

Step 7: Convert .py back to notebooks (optional)


To convert .py files back to .ipynb after pulling updates:

PYTHON

!jupytext --to notebook Interacting-with-GCS.py --output Interacting-with-GCS.ipynb
Challenge

Challenge: GitHub PAT Workflow

  • Why might you prefer using a PAT with HTTPS instead of SSH keys in Vertex AI Workbench?
  • What are the benefits of converting .ipynb files to .py before committing to a shared repo?
  • PATs with HTTPS are easier to set up in temporary environments where SSH configs don’t persist.
  • Converting notebooks to .py results in cleaner diffs, easier code review, and smaller repos without stored outputs/metadata.
Key Points
  • Use a GitHub PAT for HTTPS-based authentication in Vertex AI Workbench notebooks.
  • Securely enter sensitive information in notebooks using getpass.
  • Converting .ipynb files to .py files helps with cleaner version control.
  • Adding .ipynb files to .gitignore keeps your repository organized.

Content from Training Models in Vertex AI: Intro


Last updated on 2025-10-30 | Edit this page

Overview

Questions

  • What are the differences between training locally in a Vertex AI notebook and using Vertex AI-managed training jobs?
  • How do custom training jobs in Vertex AI streamline the training process for various frameworks?
  • How does Vertex AI handle scaling across CPUs, GPUs, and TPUs?

Objectives

  • Understand the difference between local training in a Vertex AI Workbench notebook and submitting managed training jobs.
  • Learn to configure and use Vertex AI custom training jobs for different frameworks (e.g., XGBoost, PyTorch, SKLearn).
  • Understand scaling options in Vertex AI, including when to use CPUs, GPUs, or TPUs.
  • Compare performance, cost, and setup between custom scripts and pre-built containers in Vertex AI.
  • Conduct training with data stored in GCS and monitor training job status using the Google Cloud Console.

Initial setup


1. Open pre-filled notebook

Navigate to /Intro_GCP_for_ML/notebooks/06-Training-models-in-VertexAI.ipynb to begin this notebook.

2. CD to instance home directory

To ensure we’re all in the saming starting spot, change directory to your Jupyter home directory.

PYTHON

%cd /home/jupyter/

3. Set environment variables

This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.

  • PROJECT_ID: Identifies your GCP project.
  • REGION: Determines where training jobs run (choose a region close to your data).

PYTHON

from google.cloud import storage
client = storage.Client()
PROJECT_ID = client.project
REGION = "us-central1"
BUCKET_NAME = "sinkorswim-johndoe-titanic" # ADJUST to your bucket's name
LAST_NAME = "DOE" # ADJUST to your last name or name
print(f"project = {PROJECT_ID}\nregion = {REGION}\nbucket = {BUCKET_NAME}")

Testing train.py locally in the notebook


Discussion

Understanding the XGBoost Training Script (GCP version)

Take a moment to review the train_xgboost.py script we’re using on GCP found in Intro_GCP-for_ML/scripts/train_xgboost.py. This script handles preprocessing, training, and saving an XGBoost model, while supporting local paths and GCS (gs://) paths, and it adapts to Vertex AI conventions (e.g., AIP_MODEL_DIR).

Try answering the following questions:

  1. Data preprocessing: What transformations are applied to the dataset before training?
  2. Training function: What does the train_model() function do? Why print the training time?
  3. Command-line arguments: What is the purpose of argparse in this script? How would you change the number of training rounds?
  4. Handling local vs. GCP runs: How does the script let you run the same code locally, in Workbench, or as a Vertex AI job? Which environment variable controls where the model artifact is written?
  5. Training and saving the model: What format is the dataset converted to before training, and why? How does the script save to a local path vs. a gs:// destination?

After reviewing, discuss any questions or observations with your group.

  1. Data preprocessing: The script fills missing values (Age with median, Embarked with mode), maps categorical fields to numeric (Sex → {male:1, female:0}, Embarked → {S:0, C:1, Q:2}), and drops non-predictive columns (Name, Ticket, Cabin).
  2. Training function: train_model() constructs and fits an XGBoost model with the provided parameters and prints wall-clock training time. Timing helps compare runs and make sensible scaling choices.
  3. Command-line arguments: argparse lets you set hyperparameters and file paths without editing code (e.g., --max_depth, --eta, --num_round, --train). To change rounds: python train_xgboost.py --num_round 200
  4. Handling local vs. GCP runs:
    • Input: You pass --train as either a local path (train.csv) or a GCS URI (gs://bucket/path.csv). The script automatically detects gs:// and reads the file directly from Cloud Storage using the Python client.
    • Output: If the environment variable AIP_MODEL_DIR is set (as it is in Vertex AI CustomJobs), the trained model is written there—often a gs:// path. Otherwise, the model is saved in the current working directory, which works seamlessly in both local and Workbench environments.
  5. Training and saving the model:
    The training data is converted into an XGBoost DMatrix, an optimized format that speeds up training and reduces memory use. The trained model is serialized with joblib. When saving locally, the file is written directly to disk. If saving to a Cloud Storage path (gs://...), the model is first saved to a temporary file and then uploaded to the specified bucket.

Before scaling training jobs onto managed resources, it’s essential to test your training script locally. This prevents wasting GPU/TPU time on bugs or misconfigured code.

Guidelines for testing ML pipelines before scaling

  • Run tests locally first with small datasets.
  • Use a subset of your dataset (1–5%) for fast checks.
  • Start with minimal compute before moving to larger accelerators.
  • Log key metrics such as loss curves and runtimes.
  • Verify correctness first before scaling up.

What tests should we do before scaling?

Before scaling to multiple or more powerful instances (e.g., GPUs or TPUs), it’s important to run a few sanity checks. Skipping these can lead to: silent data bugs, runtime blowups at scale, inefficient experiments, or broken model artifacts.

Here is a non-exhaustive list of suggested tests to perform before scaling up your compute needs.

  • Reproducibility - do you get the same result each time you run your code? If not, set seeds controlling randomness.
  • Data loads correctly – dataset loads without errors, expected columns exist, missing values handled.
  • Overfitting check – train on a tiny dataset (e.g., 100 rows). If it doesn’t overfit, something is off.
  • Loss behavior – verify training loss decreases and doesn’t diverge.
  • Runtime estimate – get a rough sense of training time on small data.
  • Memory estimate – check approximate memory use.
  • Save & reload – ensure model saves, reloads, and infers without errors.

Download data into notebook environment


Sometimes it’s helpful to keep a copy of data in your notebook VM for quick iteration, even though GCS is the preferred storage location.

PYTHON

from google.cloud import storage

client = storage.Client()
bucket = client.bucket(BUCKET_NAME)

blob = bucket.blob("titanic_train.csv")
blob.download_to_filename("/home/jupyter/titanic_train.csv")

print("Downloaded titanic_train.csv")

Local test run of train.py


Outside of this workshop, you should run these kinds of tests on your local laptop or lab PC when possible. We’re using the Workbench VM here only for convenience in this workshop setting, but this does incur a small fee for our running VM.

  • For large datasets, use a small representative sample of the total dataset when testing locally (i.e., just to verify that code is working and model overfits nearly perfectly after training enough epochs)
  • For larger models, use smaller model equivalents (e.g., 100M vs 7B params) when testing locally

PYTHON

# We need to add xgboost to our VM before running the script
!pip install xgboost

PYTHON

# Training configuration parameters for XGBoost
MAX_DEPTH = 3         # maximum depth of each decision tree (controls model complexity)
ETA = 0.1             # learning rate (how much each tree contributes to the overall model)
SUBSAMPLE = 0.8       # fraction of training samples used per boosting round (prevents overfitting)
COLSAMPLE = 0.8       # fraction of features (columns) sampled per tree (adds randomness and diversity)
NUM_ROUND = 100       # number of boosting iterations (trees) to train

import time as t
start = t.time()

# Run the custom training script with hyperparameters defined above
!python Intro_GCP_for_ML/scripts/train_xgboost.py \
    --max_depth $MAX_DEPTH \
    --eta $ETA \
    --subsample $SUBSAMPLE \
    --colsample_bytree $COLSAMPLE \
    --num_round $NUM_ROUND \
    --train titanic_train.csv

print(f"Total local runtime: {t.time() - start:.2f} seconds")

Training on this small dataset should take <1 minute. Log runtime as a baseline. You should see the following output file:

  • xgboost-model # Python-serialized XGBoost model (Booster) via joblib; load with joblib.load for reuse.

Evaluate the trained model on validation data


Now that we’ve trained and saved an XGBoost model, we want to do the most important sanity check:
Does this model make reasonable predictions on unseen data?

This step: 1. Loads the serialized model artifact that was written by train_xgboost.py 2. Loads a test set of Titanic passenger data 3. Applies the same preprocessing as training 4. Generates predictions 5. Computes simple accuracy

First, we’ll download the test data

PYTHON

blob = bucket.blob("titanic_test.csv")
blob.download_to_filename("titanic_test.csv")

print("Downloaded titanic_test.csv")

Then, we apply the same preprocessing function used by our training script before applying the model to our data.

PYTHON

import pandas as pd
import xgboost as xgb
import joblib
from sklearn.metrics import accuracy_score
from Intro_GCP_for_ML.scripts.train_xgboost import preprocess_data  # reuse same preprocessing

# Load test data
test_df = pd.read_csv("titanic_test.csv")

# Apply same preprocessing from training
X_test, y_test = preprocess_data(test_df)

# Load trained model from local file
model = joblib.load("xgboost-model")

# Predict on test data
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)
y_pred_binary = (y_pred > 0.5).astype(int)

# Compute accuracy
acc = accuracy_score(y_test, y_pred_binary)
print(f"Test accuracy: {acc:.3f}")

Training via Vertex AI custom training job


Unlike “local” training using our notebook’s VM, this next approach launches a managed training job that runs on scalable compute. Vertex AI handles provisioning, scaling, logging, and saving outputs to GCS.

Which machine type to start with?

Start with a small CPU machine like n1-standard-4. Only scale up to GPUs/TPUs once you’ve verified your script. See Instances for ML on GCP for guidance.

PYTHON

MACHINE = 'n1-standard-4'

Creating a custom training job with the SDK

We’ll first initialize the Vertex AI platform with our environment variables. We’ll also set a RUN_ID and ARTIFACT_DIR to help store outputs.

PYTHON

from google.cloud import aiplatform
import datetime as dt
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/xgb/{RUN_ID}/"  # everything will live beside this
print(f"project = {PROJECT_ID}\nregion = {REGION}\nbucket = {BUCKET_NAME}\nartifact_dir = {ARTIFACT_DIR}")

# Staging bucket is only for the SDK's temp code tarball (aiplatform-*.tar.gz)
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}/.vertex_staging")

This next section defines a custom training job in Vertex AI, specifying how and where the training code will run. It points to your training script (train_xgboost.py), uses Google’s prebuilt XGBoost training container image, and installs any extra dependencies your script needs (in this case, google-cloud-storage for accessing GCS). The display_name sets a readable name for tracking the job in the Vertex AI console.

Prebuilt containers for training

Vertex AI provides prebuilt Docker container images for model training. These containers are organized by machine learning frameworks and framework versions and include common dependencies that you might want to use in your training code. To learn more about prebuilt training containers, see Prebuilt containers for custom training.

PYTHON


job = aiplatform.CustomTrainingJob(
    display_name=f"{LAST_NAME}_xgb_{RUN_ID}",
    script_path="Intro_GCP_for_ML/scripts/train_xgboost.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.2-1:latest",
)

Finally, this next block launches the custom training job on Vertex AI using the configuration defined earlier. **We won’t be charged for our selected MACHINE until we run the below code using job.run(). This marks the point when our script actually begins executing remotely on the Vertex training infrastructure. Once job.run() is called, Vertex handles packaging your training script, transferring it to the managed training environment, provisioning the requested compute instance, and monitoring the run. The job’s status and logs can be viewed directly in the Vertex AI Console under Training → Custom jobs.

If you need to cancel or modify a job mid-run, you can do so from the console or via the SDK by calling job.cancel(). When the job completes, Vertex automatically tears down the compute resources so you only pay for the active training time.

  • The args list passes command-line parameters directly into your training script, including hyperparameters and the path to the training data in GCS.
  • base_output_dir specifies where all outputs (model, metrics, logs) will be written in Cloud Storage
  • machine_type controls the compute resources used for training.
  • When sync=True, the notebook waits until the job finishes before continuing, making it easier to inspect results immediately after training.

PYTHON

job.run(
    args=[
        f"--train=gs://{BUCKET_NAME}/titanic_train.csv",
        f"--max_depth={MAX_DEPTH}",
        f"--eta={ETA}",
        f"--subsample={SUBSAMPLE}",
        f"--colsample_bytree={COLSAMPLE}",
        f"--num_round={NUM_ROUND}",
    ],
    replica_count=1,
    machine_type=MACHINE, # MACHINE variable defined above; adjust to something more powerful when needed
    base_output_dir=ARTIFACT_DIR,  # sets AIP_MODEL_DIR for your script
    sync=True,
)

print("Model + logs folder:", ARTIFACT_DIR)

This launches a managed training job with Vertex AI. It should take 2-5 minutes for the training job to complete.

Understanding the training output message

After your job finishes, you may see a message like: Training did not produce a Managed Model returning None. This is expected when running a CustomTrainingJob without specifying deployment parameters. Vertex AI supports two modes:

  • CustomTrainingJob (research/development) – You control training and save models/logs to Cloud Storage via AIP_MODEL_DIR. This is ideal for experimentation and cost control.
  • TrainingPipeline (for deployment) – You include model_serving_container_image_uri and model_display_name, and Vertex automatically registers a Managed Model in the Model Registry for deployment to an endpoint.

In our setup, we’re intentionally using the simpler CustomTrainingJob path. Your trained model is safely stored under your specified artifact directory (e.g., gs://{BUCKET_NAME}/artifacts/xgb/{RUN_ID}/), and you can later register or deploy it manually when ready.

Monitoring training jobs in the Console


  1. Go to the Google Cloud Console.
  2. Navigate to Vertex AI > Training > Custom Jobs.
  3. Click on your job name to see status, logs, and output model artifacts.
  4. Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).

Training artifacts


After the training run completes, we can manually view our bucket using the Google Cloud Console or run the below code.

PYTHON

total_size_bytes = 0
# bucket = client.bucket(BUCKET_NAME)

for blob in client.list_blobs(BUCKET_NAME):
    total_size_bytes += blob.size
    print(blob.name)

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{BUCKET_NAME}': {total_size_mb:.2f} MB")

Training Artifacts → ARTIFACT_DIR

This is your intended output location, set via base_output_dir.
It contains everything your training script explicitly writes. In our case, this includes:

  • {BUCKET_NAME}/artifacts/xgb/{RUN_ID}/xgboost-model — Serialized XGBoost model (Booster) saved via joblib; reload later with joblib.load() for reuse or deployment.

System-Generated Files

Additional system-generated files (e.g., Vertex’s .tar.gz code package or executor_output.json) will appear under .vertex_staging/ and can be safely ignored or auto-deleted via lifecycle rules.

Evaluated the trained model stored on GCS


import io
# Load test data directly into memory
bucket = client.bucket(BUCKET_NAME)
blob = bucket.blob("titanic_test.csv")
test_df = pd.read_csv(io.BytesIO(blob.download_as_bytes()))


# Apply same preprocessing logic used during training
X_test, y_test = preprocess_data(test_df)

# -------------------------
# 4. Load the model artifact we just pulled from GCS
# -------------------------
MODEL_BLOB_PATH = f"artifacts/xgb/{RUN_ID}/model/xgboost-model"
model_blob = bucket.blob(MODEL_BLOB_PATH)
model_bytes = model_blob.download_as_bytes()
model = joblib.load(io.BytesIO(model_bytes))

# -------------------------
# 5. Run predictions and compute accuracy
# -------------------------
dtest = xgb.DMatrix(X_test)
y_pred_prob = model.predict(dtest)
y_pred = (y_pred_prob >= 0.5).astype(int)

acc = accuracy_score(y_test, y_pred)
print(f"Test accuracy (model from Vertex job): {acc:.3f}")

When training takes too long

Two main options in Vertex AI:

Option 1: Upgrade to more powerful machine types
- The simplest way to reduce training time is to use a larger machine or add GPUs (e.g., T4, V100, A100).
- This works best for small or medium datasets (<10 GB) and avoids the coordination overhead of distributed training.
- GPUs and TPUs can accelerate deep learning workloads significantly.

Option 2: Use distributed training with multiple replicas
- Vertex AI supports distributed training for many frameworks.
- The dataset is split across replicas, each training a portion of the data with synchronized gradient updates.
- This approach is most useful for large datasets and long-running jobs.

When distributed training makes sense
- Dataset size exceeds 10–50 GB.
- Training on a single machine takes more than 10 hours.
- The model is a deep learning workload that scales naturally across GPUs or TPUs.

We will explore both options more in depth in the next episode when we train a neural network.

Key Points
  • Environment initialization: Use aiplatform.init() to set defaults for project, region, and bucket.
  • Local vs managed training: Test locally before scaling into managed jobs.
  • Custom jobs: Vertex AI lets you run scripts as managed training jobs using pre-built or custom containers.
  • Scaling: Start small, then scale up to GPUs or distributed jobs as dataset/model size grows.
  • Monitoring: Track job logs and artifacts in the Vertex AI Console.

Content from Training Models in Vertex AI: PyTorch Example


Last updated on 2025-10-30 | Edit this page

Overview

Questions

  • When should you consider a GPU (or TPU) instance for PyTorch training in Vertex AI, and what are the trade‑offs for small vs. large workloads?
  • How do you launch a script‑based training job and write all artifacts (model, metrics, logs) next to each other in GCS without deploying a managed model?

Objectives

  • Prepare the Titanic dataset and save train/val arrays to compressed .npz files in GCS.
  • Submit a CustomTrainingJob that runs a PyTorch script and explicitly writes outputs to a chosen gs://…/artifacts/.../ folder.
  • Co‑locate artifacts: model.pt (or .joblib), metrics.json, eval_history.csv, and training.log for reproducibility.
  • Choose CPU vs. GPU instances sensibly; understand when distributed training is (not) worth it.

Initial setup


1. Open pre-filled notebook

Navigate to /Intro_GCP_for_ML/notebooks/06-Training-models-in-VertexAI-GPUs.ipynb to begin this notebook. Select the PyTorch environment (kernel) Local PyTorch is only needed for local tests. Your Vertex AI job uses the container specified by container_uri (e.g., pytorch-cpu.2-1 or pytorch-gpu.2-1), so it brings its own framework at run time.

2. CD to instance home directory

To ensure we’re all in the saming starting spot, change directory to your Jupyter home directory.

PYTHON

%cd /home/jupyter/

3. Set environment variables

This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.

PYTHON

from google.cloud import aiplatform, storage
client = storage.Client()
PROJECT_ID = client.project
REGION = "us-central1"
BUCKET_NAME = "sinkorswim-johndoe-titanic" # ADJUST to your bucket's name
LAST_NAME = 'DOE' # ADJUST to your last name. Since we're in a shared account envirnoment, this will help us track down jobs in the Console

print(f"project = {PROJECT_ID}\nregion = {REGION}\nbucket = {BUCKET_NAME}")

# initializes the Vertex AI environment with the correct project and location. Staging bucket is used for storing the compressed software that's packaged for training/tuning jobs.
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}/.vertex_staging") # store tar balls in staging folder 

Prepare data as .npz


Why .npz? NumPy’s .npz files are compressed binary containers that can store multiple arrays (e.g., features and labels) together in a single file. They offer numerous benefits:

  • Smaller, faster I/O than CSV for arrays.
  • One file can hold multiple arrays (X_train, y_train).
  • Natural fit for torch.utils.data.Dataset / DataLoader.
  • Cloud-friendly: compressed .npz files reduce upload and download times and minimize GCS egress costs. Because each .npz is a single binary object, reading it from Google Cloud Storage (GCS) requires only one network call—much faster and cheaper than streaming many small CSVs or images individually.
  • Efficient data movement: when you launch a Vertex AI training job, GCS objects referenced in your script (for example, gs://.../train_data.npz) are automatically staged to the job’s VM or container at runtime. Vertex copies these objects into its local scratch disk before execution, so subsequent reads (e.g., np.load(...)) occur from local storage rather than directly over the network. For small-to-medium datasets, this happens transparently and incurs minimal startup delay.
  • Reproducible binary format: unlike CSV, .npz preserves exact dtypes and shapes, ensuring identical results across different environments and containers.
  • Each GCS object read or listing request incurs a small per-request cost; using a single .npz reduces both the number of API calls and associated latency.

PYTHON

import pandas as pd
import io
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load Titanic CSV (from local or GCS you've already downloaded to the notebook)
bucket = client.bucket(BUCKET_NAME)
blob = bucket.blob("titanic_train.csv")
df = pd.read_csv(io.BytesIO(blob.download_as_bytes()))

# Minimal preprocessing to numeric arrays
sex_enc = LabelEncoder().fit(df["Sex"])            # Fit label encoder on 'Sex' column (male/female)
df["Sex"] = sex_enc.transform(df["Sex"])           # Convert 'Sex' to numeric values (e.g., male=1, female=0)
df["Embarked"] = df["Embarked"].fillna("S")       # Replace missing embarkation ports with most common ('S')
emb_enc = LabelEncoder().fit(df["Embarked"])       # Fit label encoder on 'Embarked' column (S/C/Q)
df["Embarked"] = emb_enc.transform(df["Embarked"]) # Convert embarkation categories to numeric codes
df["Age"] = df["Age"].fillna(df["Age"].median())   # Fill missing ages with median (robust to outliers)
df["Fare"] = df["Fare"].fillna(df["Fare"].median())# Fill missing fares with median to avoid NaNs

X = df[["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].values  # Select numeric feature columns as input
y = df["Survived"].values                                                # Target variable (1=survived, 0=did not survive)

scaler = StandardScaler()                                                # Initialize standard scaler for standardization (best practice for neural net training)
X = scaler.fit_transform(X)                                              # Scale features to mean=0, std=1 for stable training

X_train, X_val, y_train, y_val = train_test_split(                       # Split dataset into training and validation sets
    X, y, test_size=0.2, random_state=42)                                # 80% training, 20% validation (fixed random seed)

np.savez("/home/jupyter/train_data.npz", X_train=X_train, y_train=y_train)             # Save training arrays to compressed .npz file
np.savez("/home/jupyter/val_data.npz",   X_val=X_val,   y_val=y_val)                   # Save validation arrays to compressed .npz file

We can then upload the files to our GCS bucket.

PYTHON

# Upload to GCS
bucket.blob("data/train_data.npz").upload_from_filename("/home/jupyter/train_data.npz")
bucket.blob("data/val_data.npz").upload_from_filename("/home/jupyter/val_data.npz")
print("Uploaded: gs://%s/data/train_data.npz and val_data.npz" % BUCKET_NAME)

To check our work (bucket contents), we can again use the following code:

PYTHON

total_size_bytes = 0
# bucket = client.bucket(BUCKET_NAME)

for blob in client.list_blobs(BUCKET_NAME):
    total_size_bytes += blob.size
    print(blob.name)

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{BUCKET_NAME}': {total_size_mb:.2f} MB")

Minimal PyTorch training script (train_nn.py) - local test


Outside of this workshop, you should run these kinds of tests on your local laptop or lab PC when possible. We’re using the Workbench VM here only for convenience in this workshop setting, but this does incur a small fee for our running VM.

  • For large datasets, use a small representative sample of the total dataset when testing locally (i.e., just to verify that code is working and model overfits nearly perfectly after training enough epochs)
  • For larger models, use smaller model equivalents (e.g., 100M vs 7B params) when testing locally

Find this file in our repo: Intro_GCP_for_ML/scripts/train_nn.py. It does three things: 1) loads .npz from local or GCS 2) trains a tiny multilayer perceptron (MLP) 3) writes all outputs side‑by‑side (model + metrics + eval history + training.log) to the same --model_out folder.

To test this code, we can run the following:

PYTHON

# configure training hyperparameters to use in all model training runs downstream
MAX_EPOCHS = 500
LR =  0.001
PATIENCE = 50

# local training run
import time as t

start = t.time()

# Example: run your custom training script with args
!python /home/jupyter/Intro_GCP_for_ML/scripts/train_nn.py \
    --train /home/jupyter/train_data.npz \
    --val /home/jupyter/val_data.npz \
    --epochs $MAX_EPOCHS \
    --learning_rate $LR \
    --patience $PATIENCE

print(f"Total local runtime: {t.time() - start:.2f} seconds")

If applicable (numpy mismatch), run the below code after uncommenting it (select code and type Ctrl+/ for multiline uncommenting)

PYTHON

# # Fix numpy mismatch
# !pip install --upgrade --force-reinstall "numpy<2"

# # Then, rerun:

# import time as t

# start = t.time()

# # Example: run your custom training script with args
# !python /home/jupyter/Intro_GCP_for_ML/scripts/train_nn.py \
#     --train /home/jupyter/train_data.npz \
#     --val /home/jupyter/val_data.npz \
#    --epochs $MAX_EPOCHS \
#    --learning_rate $LR \
#    --patience $PATIENCE


# print(f"Total local runtime: {t.time() - start:.2f} seconds")

Reproducibility test

Without reproducibility, it’s impossible to gain reliable insights into the efficacy of our methods. An essential component of applied ML/AI is ensuring our experiments are reproducible. Let’s first rerun the same code we did above to verify we get the same result.

  • Take a look near the top of Intro_GCP_for_ML/scripts/train_nn.py where we are setting multiple numpy and torch seeds to ensure reproducibility.

PYTHON

import time as t

start = t.time()

# Example: run your custom training script with args
!python /home/jupyter/Intro_GCP_for_ML/scripts/train_nn.py \
    --train /home/jupyter/train_data.npz \
    --val /home/jupyter/val_data.npz \
    --epochs $MAX_EPOCHS \
    --learning_rate $LR \
    --patience $PATIENCE

print(f"Total local runtime: {t.time() - start:.2f} seconds")

Please don’t use cloud resources for code that is not reproducible!

Evaluate the locally trained model on the validation data

PYTHON

import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet

# load validation data
d = np.load("/home/jupyter/val_data.npz")
X_val, y_val = d["X_val"], d["y_val"]

# tensors
X_val_t = torch.tensor(X_val, dtype=torch.float32)
y_val_t = torch.tensor(y_val, dtype=torch.long)

# rebuild model and load weights
m = TitanicNet()
state = torch.load("/home/jupyter/model.pt", map_location="cpu")
m.load_state_dict(state)
m.eval()

with torch.no_grad():
    probs = m(X_val_t).squeeze(1)                # [N], sigmoid outputs in (0,1)
    preds_t = (probs >= 0.5).long()              # [N] int64
    correct = (preds_t == y_val_t).sum().item()
    acc = correct / y_val_t.shape[0]

print(f"Local model val accuracy: {acc:.4f}")

We should see an accuracy that matches our best epoch in the local training run. Note that in our setup, early stopping is based on validation loss; not accuracy.

Launch the training job


In the previous episode, we trained an XGBoost model using Vertex AI’s CustomTrainingJob interface. Here, we’ll do the same for a PyTorch neural network. The structure is nearly identical — we define a training script, select a prebuilt container (CPU or GPU), and specify where to write all outputs in Google Cloud Storage (GCS). The main difference is that PyTorch requires us to save our own model weights and metrics inside the script rather than relying on Vertex to package a model automatically.

Set training job configuration vars

For our image, we can find the corresponding PyTorch image by visiting: cloud.google.com/vertex-ai/docs/training/pre-built-containers#pytorch

PYTHON

import datetime as dt
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch/{RUN_ID}"
IMAGE = 'us-docker.pkg.dev/vertex-ai/training/pytorch-xla.2-4.py310:latest' # cpu-only version
MACHINE = "n1-standard-4" # CPU fine for small datasets

print(f"RUN_ID = {RUN_ID}\nARTIFACT_DIR = {ARTIFACT_DIR}\nMACHINE = {MACHINE}")

Init the training job with configurations

PYTHON

# init job (this does not consume any resources)
DISPLAY_NAME = f"{LAST_NAME}_pytorch_nn_{RUN_ID}" 
print(DISPLAY_NAME)

# init the job. This does not consume resources until we run job.run()
job = aiplatform.CustomTrainingJob(
    display_name=DISPLAY_NAME,
    script_path="Intro_GCP_for_ML/scripts/train_nn.py",
    container_uri=IMAGE)

Run the job, paying for our MACHINE on-demand.

PYTHON

job.run(
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        f"--epochs={MAX_EPOCHS}",
        f"--learning_rate={LR}",
        f"--patience={PATIENCE}",
    ],
    replica_count=1,
    machine_type=MACHINE,
    base_output_dir=ARTIFACT_DIR,  # sets AIP_MODEL_DIR used by your script
    sync=True,
)
print("Artifacts folder:", ARTIFACT_DIR)

Monitoring training jobs in the Console

  1. Go to the Google Cloud Console.
  2. Navigate to Vertex AI > Training > Custom Jobs.
  3. Click on your job name to see status, logs, and output model artifacts.
  4. Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).

Quick link: https://console.cloud.google.com/vertex-ai/training/training-pipelines?hl=en&project=doit-rci-mlm25-4626

Check our bucket contents to verify expected outputs are there.

PYTHON

total_size_bytes = 0
# bucket = client.bucket(BUCKET_NAME)

for blob in client.list_blobs(BUCKET_NAME):
    total_size_bytes += blob.size
    print(blob.name)

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{BUCKET_NAME}': {total_size_mb:.2f} MB")

What you’ll see in gs://…/artifacts/pytorch/<RUN_ID>/:

  • model.pt — PyTorch weights (state_dict).
  • metrics.json — final val loss, hyperparameters, dataset sizes, device, model URI.
  • eval_history.csv — per‑epoch validation loss (for plots/regression checks).
  • training.log — complete stdout/stderr for reproducibility and debugging.

Evaluate the Vertex-trained model on the validation data

We can check out work to see if this model gives the same result as our “locally” trained model above.

To follow best practices, we will simply load this model into memory from GCS.

PYTHON

import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet

# -----------------
# download model.pt straight into memory and load weights
# -----------------

ARTIFACT_PREFIX = f"artifacts/pytorch/{RUN_ID}/model"

MODEL_PATH = f"{ARTIFACT_PREFIX}/model.pt"
model_blob = bucket.blob(MODEL_PATH)
model_bytes = model_blob.download_as_bytes()

# load from bytes
model_pt = io.BytesIO(model_bytes)

# rebuild model and load weights
state = torch.load(model_pt, map_location="cpu")
m = TitanicNet()
m.load_state_dict(state)
m.eval(); # set model to eval mode

# -----------------
# ALT: download copy of model into VM (costs extra storage)
# -----------------
# # Copy model.pt from GCS (replace RUN_ID with your run folder)
# !gsutil cp {ARTIFACT_DIR}/model/model.pt /home/jupyter/model_vertex.pt
# !ls
# # rebuild model and load weights
# m = TitanicNet()
# state = torch.load("/home/jupyter/model_vertex.pt", map_location="cpu")  
# m.load_state_dict(state)
# m.eval()

As before, we can run our model evaluation code with this model.

To follow best practices, we will read our validation data from GCS and avoid having a copy in our VM.

PYTHON

# read validation data into memory
VAL_PATH = "data/val_data.npz"
val_blob = bucket.blob(VAL_PATH)
val_bytes = val_blob.download_as_bytes()
d = np.load(io.BytesIO(val_bytes))
X_val, y_val = d["X_val"], d["y_val"]
X_val_t = torch.tensor(X_val, dtype=torch.float32)

# get predictions
with torch.no_grad():
    probs = m(X_val_t).squeeze(1)         # [N], sigmoid outputs in (0,1)
    preds_t = (probs >= 0.5).long()       # threshold at 0.5 -> class label 0/1
    correct = (preds_t == y_val_t).sum().item()
    acc = correct / y_val_t.shape[0]

print(f"Vertex model val accuracy: {acc:.4f}")

GPU-Accelerated Training on Vertex AI


In the previous example, we ran our PyTorch training job on a CPU-only machine using the pytorch-cpu container. That setup works well for small models or quick tests since CPU instances are cheaper and start faster.

In this section, we’ll attach a GPU to our Vertex AI training job to speed up heavier workloads. The workflow is nearly identical to the CPU version, except for a few changes:

  • The container image switches to the GPU-enabled version (pytorch-gpu.2-4.py310:latest), which includes CUDA and cuDNN.
  • The machine type (n1-standard-8) defines CPU and memory resources, while we now add a GPU accelerator (NVIDIA_TESLA_T4, NVIDIA_L4, etc.). For guidance on selecting a machine type and accelerator, visit the Compute for ML resource.
  • The training script, arguments, and artifact handling all stay the same.

This makes it easy to start with a CPU run for testing, then scale up to GPU training by changing only the image and adding accelerator parameters.

PYTHON

from google.cloud import aiplatform

LAST_NAME = "DOE"  # Your last name goes in the job display name so it's easy to find in the Console
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")

# GCS folder where ALL artifacts (model.pt, metrics.json, eval_history.csv, training.log) will be saved.
# Your train_nn.py writes to AIP_MODEL_DIR, and base_output_dir (below) sets that variable for the job.
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch/{RUN_ID}"

# ---- Container image ----
# Use a prebuilt TRAINING image that has PyTorch + CUDA. This enables GPU at runtime.
IMAGE = "us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-4.py310:latest"

# ---- Machine vs Accelerator (important!) ----
# machine_type = the VM's CPU/RAM shape. It is NOT a GPU by itself.
# We often pick n1-standard-8 as a balanced baseline for single-GPU jobs.
MACHINE = "n1-standard-8"

# To actually get a GPU, you *attach* one via accelerator_type + accelerator_count.
# Common choices:
#   "NVIDIA_TESLA_T4" (cost-effective, widely available)
#   "NVIDIA_L4"       (newer, CUDA 12.x, good perf/$)
#   "NVIDIA_TESLA_V100" / "NVIDIA_A100_40GB" (high-end, pricey)
ACCELERATOR_TYPE = "NVIDIA_TESLA_T4"
ACCELERATOR_COUNT = 1  # Increase (2,4) only if your code supports multi-GPU (e.g., DDP)

# Alternative (GPU-bundled) machines:
# If you pick an A2 type like "a2-highgpu-1g", it already includes 1 A100 GPU.
# In that case, you can omit accelerator_type/accelerator_count entirely.
# Example:
# MACHINE = "a2-highgpu-1g"
# (and then remove the accelerator_* kwargs in job.run)

print(
    "RUN_ID =", RUN_ID,
    "\nARTIFACT_DIR =", ARTIFACT_DIR,
    "\nIMAGE =", IMAGE,
    "\nMACHINE =", MACHINE,
    "\nACCELERATOR_TYPE =", ACCELERATOR_TYPE,
    "\nACCELERATOR_COUNT =", ACCELERATOR_COUNT,
)

DISPLAY_NAME = f"{LAST_NAME}_pytorch_nn_{RUN_ID}"

job = aiplatform.CustomTrainingJob(
    display_name=DISPLAY_NAME,
    script_path="Intro_GCP_for_ML/scripts/train_nn.py",  # Your PyTorch trainer
    container_uri=IMAGE,  # Must be a *training* image (not prediction)
)

job.run(
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        f"--epochs={MAX_EPOCHS}",
        f"--learning_rate={LR}",
        f"--patience={PATIENCE}",
    ],
    replica_count=1,                 # One worker (simple, cheaper)
    machine_type=MACHINE,            # CPU/RAM shape of the VM (no GPU implied)
    accelerator_type=ACCELERATOR_TYPE,   # Attaches the selected GPU model
    accelerator_count=ACCELERATOR_COUNT, # Number of GPUs to attach
    base_output_dir=ARTIFACT_DIR,    # Sets AIP_MODEL_DIR used by your script for all artifacts
    sync=True,                       # Waits for job to finish so you can inspect outputs immediately
)

print("Artifacts folder:", ARTIFACT_DIR)

PYTHON

import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet

# -----------------
# download model.pt straight into memory and load weights
# -----------------

ARTIFACT_PREFIX = f"artifacts/pytorch/{RUN_ID}/model"

MODEL_PATH = f"{ARTIFACT_PREFIX}/model.pt"
model_blob = bucket.blob(MODEL_PATH)
model_bytes = model_blob.download_as_bytes()

# load from bytes
model_pt = io.BytesIO(model_bytes)

# rebuild model and load weights
state = torch.load(model_pt, map_location="cpu")
m = TitanicNet()
m.load_state_dict(state)
m.eval(); # set model to eval mode

# -----------------
# ALT: download copy of model into VM (costs extra storage)
# -----------------
# # Copy model.pt from GCS (replace RUN_ID with your run folder)
# !gsutil cp {ARTIFACT_DIR}/model/model.pt /home/jupyter/model_vertex.pt
# !ls
# # rebuild model and load weights
# m = TitanicNet()
# state = torch.load("/home/jupyter/model_vertex.pt", map_location="cpu")  
# m.load_state_dict(state)
# m.eval()

PYTHON

# get predictions
with torch.no_grad():
    probs = m(X_val_t).squeeze(1)         # [N], sigmoid outputs in (0,1)
    preds_t = (probs >= 0.5).long()       # threshold at 0.5 -> class label 0/1
    correct = (preds_t == y_val_t).sum().item()
    acc = correct / y_val_t.shape[0]

print(f"Vertex model val accuracy: {acc:.4f}")

GPU tips: - On small problems, GPU startup/transfer overhead can erase speedups—benchmark before you scale. - Stick to a single replica unless your batch sizes and dataset really warrant data parallelism.

Distributed training (when to consider)


  • Data parallelism (DDP) helps when a single GPU is saturated by batch size/throughput. For most workshop‑scale models, a single machine/GPU is simpler and cheaper.
  • Model parallelism is for very large networks that don’t fit on one device—overkill for this lesson.

Additional resources


To learn more about PyTorch and Vertex AI integrations, visit the docs: docs.cloud.google.com/vertex-ai/docs/start/pytorch

Key Points
  • Use CustomTrainingJob with a prebuilt PyTorch container; let your script control outputs via --model_out.
  • Keep artifacts together (model, metrics, history, log) in one folder for reproducibility.
  • .npz speeds up loading and plays nicely with PyTorch.
  • Start on CPU for small datasets; use GPU only when profiling shows a clear win.
  • Skip base_output_dir unless you specifically want Vertex’s default run directory; staging bucket is just for the SDK packaging tarball.

Content from Hyperparameter Tuning in Vertex AI: Neural Network Example


Last updated on 2025-10-30 | Edit this page

Overview

Questions

  • How can we efficiently manage hyperparameter tuning in Vertex AI?
  • How can we parallelize tuning jobs to optimize time without increasing costs?

Objectives

  • Set up and run a hyperparameter tuning job in Vertex AI.
  • Define search spaces for ContinuousParameter and CategoricalParameter.
  • Log and capture objective metrics for evaluating tuning success.
  • Optimize tuning setup to balance cost and efficiency, including parallelization.

To conduct efficient hyperparameter tuning with neural networks (or any model) in Vertex AI, we’ll use Vertex AI’s Hyperparameter Tuning Jobs. The key is defining a clear search space, ensuring metrics are properly logged, and keeping costs manageable by controlling the number of trials and level of parallelization.

Key steps for hyperparameter tuning

The overall process involves these steps:

  1. Prepare the training script and ensure metrics are logged.
  2. Define the hyperparameter search space.
  3. Configure a hyperparameter tuning job in Vertex AI.
  4. Set data paths and launch the tuning job.
  5. Monitor progress in the Vertex AI Console.
  6. Extract the best model and inspect recorded metrics.

0. Initial setup

Navigate to /Intro_GCP_for_ML/notebooks/08-Hyperparameter-tuning.ipynb to begin this notebook. Select the PyTorch environment (kernel) Local PyTorch is only needed for local tests. Your Vertex AI job uses the container specified by container_uri (e.g., pytorch-cpu.2-1 or pytorch-gpu.2-1), so it brings its own framework at run time.

Change to your Jupyter home folder to keep paths consistent.

PYTHON

%cd /home/jupyter/

1. Prepare training script with metric logging

Your training script (train_nn.py) should report validation metrics in a way Vertex AI can track during hyperparameter tuning.
Add the following right after computing val_loss and val_acc inside your epoch loop:

PYTHON

# Eenable Vertex HPT metric reporting 
try:
    # cloudml-hypertune is installed on Vertex AI workers; on local it might not be.
    from hypertune import HyperTune
    _hpt = HyperTune()          # instance scoped to this run
    _hpt_enabled = True
except Exception:
    # If the package isn't available (e.g., local run), we simply no-op.
    _hpt = None
    _hpt_enabled = False

This ensures Vertex AI records the validation metric for each trial and can rank configurations by performance automatically.

2. Define hyperparameter search space

This step defines which parameters Vertex AI will vary across trials and their allowed ranges. The number of total settings tested is determined later using max_trial_count.

Vertex AI uses Bayesian optimization by default (internally listed as "ALGORITHM_UNSPECIFIED" in the API). That means if you don’t explicitly specify a search algorithm, Vertex AI automatically applies an adaptive Bayesian strategy to balance exploration (trying new areas of the parameter space) and exploitation (focusing near the best results so far). Each completed trial helps the tuner model how your objective metric (for example, validation_accuracy) changes across parameter values. Subsequent trials then sample new parameter combinations that are statistically more likely to improve performance, which usually yields better results than random or grid search—especially when max_trial_count is limited.

Include early-stopping parameters so the tuner can learn good stopping behavior for your dataset:

PYTHON

from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=1e-4, max=1e-2, scale="log"),
    "patience": hpt.IntegerParameterSpec(min=5, max=20, scale="linear"),
    "min_delta": hpt.DoubleParameterSpec(min=1e-6, max=1e-3, scale="log"),
}

3. Initialize Vertex AI, project, and bucket

Initialize the Vertex AI SDK and set your staging and artifact locations in GCS.

PYTHON

from google.cloud import aiplatform, storage
import datetime as dt

client = storage.Client()
PROJECT_ID = client.project
REGION = "us-central1"
LAST_NAME = "DOE"  # change to your name or unique ID
BUCKET_NAME = "sinkorswim-johndoe-titanic"  # replace with your bucket name

aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=f"gs://{BUCKET_NAME}/.vertex_staging",
)

4. Define runtime configuration

Create a unique run ID and set the container, machine type, and base output directory for artifacts.

PYTHON

RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch_hpt/{RUN_ID}"

IMAGE = "us-docker.pkg.dev/vertex-ai/training/pytorch-xla.2-4.py310:latest"  # CPU example
MACHINE = "n1-standard-4"
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0

5. Configure hyperparameter tuning job

When you use Vertex AI Hyperparameter Tuning Jobs, each trial needs a complete, runnable training configuration: the script, its arguments, the container image, and the compute environment.
Rather than defining these pieces inline each time, we create a CustomJob to hold that configuration.

The CustomJob acts as the blueprint for running a single training task — specifying exactly what to run and on what resources. The tuner then reuses that job definition across all trials, automatically substituting in new hyperparameter values for each run.

This approach has a few practical advantages:

  • You only define the environment once — machine type, accelerators, and output directories are all reused across trials.
  • The tuner can safely inject trial-specific parameters (those declared in parameter_spec) while leaving other arguments unchanged.
  • It provides a clean separation between what a single job does (CustomJob) and how many times to repeat it with new settings (HyperparameterTuningJob).
  • It avoids the extra abstraction layers of higher-level wrappers like CustomTrainingJob, which automatically package code and environments. Using CustomJob.from_local_script keeps the workflow predictable and explicit.

In short:
CustomJob defines how to run one training run.
HyperparameterTuningJob defines how to repeat it with different parameter sets and track results.

The number of total runs is set by max_trial_count, and the number of simultaneous runs is controlled by parallel_trial_count. Each trial’s output and metrics are logged under the GCS base_output_dir. ALWAYS START WITH 1 trial before scaling up max_trial_count.

PYTHON

# metric_spec = {"validation_loss": "minimize"} - also stored by train_nn.py
metric_spec = {"validation_accuracy": "maximize"}

custom_job = aiplatform.CustomJob.from_local_script(
    display_name=f"{LAST_NAME}_pytorch_hpt-trial_{RUN_ID}",
    script_path="/home/jupyter/Intro_GCP_for_ML/scripts/train_nn.py",
    container_uri=IMAGE,
    requirements=["python-json-logger>=2.0.7"],
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        "--learning_rate=0.001",        # HPT will override when sampling
        "--patience=10",                # HPT will override when sampling
    ],
    base_output_dir=ARTIFACT_DIR,
    machine_type=MACHINE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_COUNT,
)

DISPLAY_NAME = f"{LAST_NAME}_pytorch_hpt_{RUN_ID}"

# ALWAYS START WITH 1 trial before scaling up `max_trial_count`
tuning_job = aiplatform.HyperparameterTuningJob(
    display_name=DISPLAY_NAME,
    custom_job=custom_job,                 # must be a CustomJob (not CustomTrainingJob)
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=1,                    # controls how many configurations are tested
    parallel_trial_count=1,                # how many run concurrently (keep small for adaptive search)
    # search_algorithm="ALGORITHM_UNSPECIFIED",  # default = adaptive search (Bayesian)
    # search_algorithm="RANDOM_SEARCH",          # optional override
    # search_algorithm="GRID_SEARCH",            # optional override
)

tuning_job.run(sync=True)
print("HPT artifacts base:", ARTIFACT_DIR)

6. Monitor tuning job

Open Vertex AI → Training → Hyperparameter tuning jobs to track trials, parameters, and metrics. You can also stop jobs from the console if needed. For MLM25, the folllowing link should work: https://console.cloud.google.com/vertex-ai/training/hyperparameter-tuning-jobs?hl=en&project=doit-rci-mlm25-4626.

7. Inspect best trial results

After completion, look up the best configuration and objective value from the SDK:

PYTHON

best_trial = tuning_job.trials[0]  # best-first
print("Best hyperparameters:", best_trial.parameters)
print("Best validation_accuracy:", best_trial.final_measurement.metrics)

8. Review recorded metrics in GCS

Your script writes a metrics.json (with keys such as final_val_accuracy, final_val_loss) to each trial’s output directory (under ARTIFACT_DIR). The snippet below aggregates those into a dataframe for side-by-side comparison.

PYTHON

from google.cloud import storage
import json, pandas as pd

def list_metrics_from_gcs(ARTIFACT_DIR: str):
    client = storage.Client()
    bucket_name = ARTIFACT_DIR.replace("gs://", "").split("/")[0]
    prefix = "/".join(ARTIFACT_DIR.replace("gs://", "").split("/")[1:])
    blobs = client.list_blobs(bucket_name, prefix=prefix)

    records = []
    for blob in blobs:
        if blob.name.endswith("metrics.json"):
            trial_id = blob.name.split("/")[-2]
            data = json.loads(blob.download_as_text())
            data["trial_id"] = trial_id
            records.append(data)
    return pd.DataFrame(records)

df = list_metrics_from_gcs(ARTIFACT_DIR)
print(df[["trial_id","final_val_accuracy","final_val_loss","best_val_loss","best_epoch","patience","min_delta","learning_rate"]].sort_values("final_val_accuracy", ascending=False))
Discussion

What is the effect of parallelism in tuning?

  • How might running 10 trials in parallel differ from running 2 at a time in terms of cost, time, and result quality?
  • When would you want to prioritize speed over adaptive search benefits?

Cost:
- If you run the same total number of trials, total cost is roughly unchanged; you’re paying for the same amount of compute, just compressed into a shorter wall-clock window.
- Parallelism can raise short-term spend rate (more machines running at once) and may increase idle/overhead if trials start/finish unevenly.

Time:
- Higher parallel_trial_count reduces wall-clock time almost linearly until you hit queue, quota, or data/IO bottlenecks.
- Startup overhead (image pull, environment setup) is paid for each concurrent trial; with many short trials, this overhead can become a larger fraction of runtime.

Result quality (adaptive search):
- Vertex AI’s adaptive search benefits from learning from early trials.
- With many trials in flight simultaneously, the tuner can’t incorporate results quickly, so it explores “blind” for longer. This often yields slightly worse final results for a fixed max_trial_count.
- With modest parallelism (e.g., 2–4), the tuner can still update beliefs and exploit promising regions sooner.

Guidelines:
- Start small: parallel_trial_count in the range 2–4 is a good default.
- Keep parallelism to ≤ 25–33% of max_trial_count when you care about adaptive quality.
- Increase parallelism when your trials are long and you’re confident the search space is well-bounded (less need for rapid adaptation).

When to prioritize speed (higher parallelism):
- Strict deadlines or demo timelines.
- Very cheap/short trials where startup time dominates.
- You’re using a non-adaptive or nearly random search space.
- You have unused quota/credits and want faster iteration.

When to prioritize adaptive quality (lower parallelism):
- Trials are expensive, noisy, or have high variance; learning from early wins saves budget.
- Small max_trial_count (e.g., ≤ 10–20).
- Early stopping is enabled and you want the tuner to exploit promising regions quickly.
- You’re adding new dimensions (e.g., LR + patience + min_delta) and want the search to refine intelligently.

Practical recipe:
- First run: max_trial_count=1, parallel_trial_count=1 (pipeline sanity check).
- Main run: max_trial_count=10–20, parallel_trial_count=2–4.
- Scale up parallelism only after the above completes cleanly and you confirm adaptive performance is acceptable.

Key Points
  • Vertex AI Hyperparameter Tuning Jobs efficiently explore parameter spaces using adaptive strategies.
  • Define parameter ranges in parameter_spec; the number of settings tried is controlled later by max_trial_count.
  • Keep the printed metric name consistent with metric_spec (here: validation_accuracy).
  • Limit parallel_trial_count (2–4) to help adaptive search.
  • Use GCS for input/output and aggregate metrics.json across trials for detailed analysis.

Content from Resource Management & Monitoring on Vertex AI (GCP)


Last updated on 2025-10-24 | Edit this page

Overview

Questions

  • How do I monitor and control Vertex AI, Workbench, and GCS costs day‑to‑day?
  • What specifically should I stop, delete, or schedule to avoid surprise charges?
  • How can I automate cleanup and set alerting so leaks get caught quickly?

Objectives

  • Identify all major cost drivers across Vertex AI (training jobs, endpoints, Workbench notebooks, batch prediction) and GCS.
  • Practice safe cleanup for Managed and User‑Managed Workbench notebooks, training/tuning jobs, batch predictions, models, endpoints, and artifacts.
  • Configure budgets, labels, and basic lifecycle policies to keep costs predictable.
  • Use gcloud/gsutil commands for auditing and rapid cleanup; understand when to prefer the Console.
  • Draft simple automation patterns (Cloud Scheduler + gcloud) to enforce idle shutdown.

What costs you money on GCP (quick map)


  • Vertex AI training jobs (Custom Jobs, Hyperparameter Tuning Jobs) — billed per VM/GPU hour while running.
  • Vertex AI endpoints (online prediction) — billed per node‑hour 24/7 while deployed, even if idle.
  • Vertex AI batch prediction jobs — billed for the job’s compute while running.
  • Vertex AI Workbench notebooks — the backing VM and disk bill while running (and disks bill even when stopped).
  • GCS buckets — storage class, object count/size, versioning, egress, and request ops.
  • Artifact Registry (containers, models) — storage for images and large artifacts.
  • Network egress — downloading data out of GCP (e.g., to your laptop) incurs cost.
  • Logging/Monitoring — high‑volume logs/metrics can add up (rare in small workshops, real in prod).

Rule of thumb: Endpoints left deployed and notebooks left running are the most common surprise bills in education/research settings.

A daily “shutdown checklist” (use now, automate later)


  1. Workbench notebooks — stop the runtime/instance when you’re done.
  2. Custom/HPT jobs — confirm no jobs stuck in RUNNING.
  3. Endpoints — undeploy models and delete unused endpoints.
  4. Batch predictions — ensure no jobs queued or running.
  5. Artifacts — delete large intermediate artifacts you won’t reuse.
  6. GCS — keep only one “source of truth”; avoid duplicate datasets in multiple buckets/regions.

Shutting down Vertex AI Workbench notebooks


Vertex AI has two notebook flavors; follow the matching steps:

  • Console: Vertex AI → WorkbenchManaged notebooks → select runtime → Stop.

  • Idle shutdown: Edit runtime → enable Idle shutdown (e.g., 60–120 min).

  • CLI:

    BASH

    # List managed runtimes (adjust region)
    gcloud notebooks runtimes list --location=us-central1
    # Stop a runtime
    gcloud notebooks runtimes stop RUNTIME_NAME --location=us-central1

User‑Managed Notebooks

  • Console: Vertex AI → WorkbenchUser‑managed notebooks → select instance → Stop.

  • CLI:

    BASH

    # List user-managed instances (adjust zone)
    gcloud notebooks instances list --location=us-central1-b
    # Stop an instance
    gcloud notebooks instances stop INSTANCE_NAME --location=us-central1-b

Disks still cost money while the VM is stopped. Delete old runtimes/instances and their disks if you’re done with them.

Cleaning up training, tuning, and batch jobs


Audit with CLI

BASH

# Custom training jobs
gcloud ai custom-jobs list --region=us-central1
# Hyperparameter tuning jobs
gcloud ai hp-tuning-jobs list --region=us-central1
# Batch prediction jobs
gcloud ai batch-prediction-jobs list --region=us-central1

Stop/delete as needed

BASH

# Example: cancel a custom job
gcloud ai custom-jobs cancel JOB_ID --region=us-central1
# Delete a completed job you no longer need to retain
gcloud ai custom-jobs delete JOB_ID --region=us-central1

Tip: Keep one “golden” successful job per experiment, then remove the rest to reduce console clutter and artifact storage.

Undeploy models and delete endpoints (major cost pitfall)


Find endpoints and deployed models

BASH

gcloud ai endpoints list --region=us-central1
gcloud ai endpoints describe ENDPOINT_ID --region=us-central1

Undeploy and delete

BASH

# Undeploy the model from the endpoint (stops node-hour charges)
gcloud ai endpoints undeploy-model ENDPOINT_ID   --deployed-model-id=DEPLOYED_MODEL_ID   --region=us-central1   --quiet

# Delete the endpoint if you no longer need it
gcloud ai endpoints delete ENDPOINT_ID --region=us-central1 --quiet

Model Registry: If you keep models registered but don’t serve them, you won’t pay endpoint node‑hours. Periodically prune stale model versions to reduce storage.

GCS housekeeping (lifecycle policies, versioning, egress)


Quick size & contents

BASH

# Human-readable bucket size
gsutil du -sh gs://YOUR_BUCKET
# List recursively
gsutil ls -r gs://YOUR_BUCKET/** | head -n 50

Lifecycle policy example

Keep workshop artifacts tidy by auto‑deleting temporary outputs and capping old versions.

  1. Save as lifecycle.json:

JSON

{
  "rule": [
    {
      "action": {"type": "Delete"},
      "condition": {"age": 7, "matchesPrefix": ["tmp/"]}
    },
    {
      "action": {"type": "Delete"},
      "condition": {"numNewerVersions": 3}
    }
  ]
}
  1. Apply to bucket:

BASH

gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET
gsutil lifecycle get gs://YOUR_BUCKET

Egress reminder

Downloading out of GCP (to local machines) incurs egress charges. Prefer in‑cloud training/evaluation and share results via GCS links.

Labels, budgets, and cost visibility


Standardize labels on all resources

Use the same labels everywhere (notebooks, jobs, buckets) so billing exports can attribute costs.

  • Examples: owner=yourname, team=ml-workshop, purpose=titanic-demo, env=dev

  • CLI examples:

    BASH

    # Add labels to a custom job on creation (Python SDK supports labels, too)
    # gcloud example when applicable:
    gcloud ai custom-jobs create --labels=owner=yourname,purpose=titanic-demo ...

Set budgets & alerts

  • In Billing → Budgets & alerts, create a budget for your project with thresholds (e.g., 50%, 80%, 100%).
  • Add forecast‑based alerts to catch trends early (e.g., projected to exceed budget).
  • Send email to multiple maintainers (not just you).

Enable billing export (optional but powerful)

  • Export billing to BigQuery to slice by service, label, or SKU.
  • Build a simple Data Studio/Looker Studio dashboard for workshop visibility.

Monitoring and alerts (catch leaks quickly)


  • Cloud Monitoring dashboards: Track notebook VM uptime, endpoint deployment counts, and job error rates.
  • Alerting policies: Trigger notifications when:
    • A Workbench runtime has been running > N hours outside workshop hours.
    • An endpoint node count > 0 for > 60 minutes after a workshop ends.
    • Spend forecast exceeds budget threshold.

Keep alerts few and actionable. Route to email or Slack (via webhook) where your team will see them.

Quotas and guardrails


  • Quotas (IAM & Admin → Quotas): cap GPU count, custom job limits, and endpoint nodes to protect budgets.
  • IAM: least privilege for service accounts used by notebooks and jobs; avoid wide Editor grants.
  • Org policies (if available): disallow costly regions/accelerators you don’t plan to use.

Automating the boring parts


Nightly auto‑stop for idle notebooks

Use Cloud Scheduler to run a daily command that stops notebooks after hours.

BASH

# Cloud Scheduler job (runs daily 22:00) to stop a specific managed runtime
gcloud scheduler jobs create http stop-runtime-job   --schedule="0 22 * * *"   --uri="https://notebooks.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/runtimes/RUNTIME_NAME:stop"   --http-method=POST   --oidc-service-account-email=SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com

Alternative: call gcloud notebooks runtimes list in a small Cloud Run job, filter by last_active_time, and stop any runtime idle > 2h.

Weekly endpoint sweep

  • List endpoints; undeploy any with zero recent traffic (check logs/metrics), then delete stale endpoints.
  • Scriptable with gcloud ai endpoints list/describe in Cloud Run or Cloud Functions on a schedule.

Common pitfalls and quick fixes


  • Forgotten endpointsUndeploy models; delete endpoints you don’t need.
  • Notebook left running all weekend → Enable Idle shutdown; schedule nightly stop.
  • Duplicate datasets across buckets/regions → consolidate; set lifecycle to purge tmp/.
  • Too many parallel HPT trials → cap parallel_trial_count (2–4) and increase max_trial_count gradually.
  • Orphaned artifacts in Artifact Registry/GCS → prune old images/artifacts after promoting a single “golden” run.
Challenge

Challenge 1 — Find and stop idle notebooks

List your notebooks and identify any runtime/instance that has likely been idle for >2 hours. Stop it via CLI.

Hints: gcloud notebooks runtimes list, gcloud notebooks instances list, ... stop

Use gcloud notebooks runtimes list --location=REGION (Managed) or gcloud notebooks instances list --location=ZONE (User‑Managed) to find candidates, then stop them with the corresponding ... stop command.

Challenge

Challenge 2 — Write a lifecycle policy

Create and apply a lifecycle rule that (a) deletes objects under tmp/ after 7 days, and (b) retains only 3 versions of any object.

Hint: gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET

Use the JSON policy shown above, then run gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET and verify with gsutil lifecycle get ....

Challenge

Challenge 3 — Endpoint sweep

List deployed endpoints in your region, undeploy any model you don’t need, and delete the endpoint if it’s no longer required.

Hints: gcloud ai endpoints list, ... describe, ... undeploy-model, ... delete

gcloud ai endpoints list --region=REGION → pick ENDPOINT_IDgcloud ai endpoints undeploy-model ENDPOINT_ID --deployed-model-id=DEPLOYED_MODEL_ID --region=REGION --quiet → if not needed, gcloud ai endpoints delete ENDPOINT_ID --region=REGION --quiet.

Key Points
  • Endpoints and running notebooks are the most common cost leaks; undeploy/stop first.
  • Prefer Managed Notebooks with Idle shutdown; schedule nightly auto‑stop.
  • Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
  • Standardize labels, set budgets, and enable billing export for visibility.
  • Use gcloud/gsutil to audit and clean quickly; automate with Scheduler + Cloud Run/Functions.

Content from Retrieval-Augmented Generation (RAG) with Vertex AI


Last updated on 2025-10-30 | Edit this page

Overview

Questions

  • How do we go from “a pile of PDFs” to “ask a question and get a cited answer” using Google Cloud tools?
  • What are the key parts of a RAG system (chunking, embedding, retrieval, generation), and how do they map onto Vertex AI services?
  • How much does each part of this pipeline cost (VM time, embeddings, LLM calls), and where can we keep it cheap?
  • Can we use open models / Hugging Face instead of Google models, and what does that change?

Objectives

  • Unpack the core RAG pipeline: ingest → chunk → embed → retrieve → answer.
  • Run a minimal, fully programmatic RAG loop on a Vertex AI Workbench VM using Google’s own foundation models (for embeddings + generation).
  • Understand how to substitute open-source / Hugging Face models if you want to avoid managed API costs.
  • Answer questions using content from provided papers and return citations instead of vibes.

Overview: What we’re building


Retrieval-Augmented Generation (RAG) is a pattern:

  1. You ask a question.
  2. The system retrieves relevant passages from your PDFs or data.
  3. An LLM answers using those passages only, with citations.

This approach powers sustainability-related projects like WattBot, which extracts AI water and energy metrics from research papers.

Cost mindset:
- VM cost: pay for Workbench instance uptime. Stop when not in use.
- Embedding cost: pay per character embedded — only once per doc.
- Generation cost: pay per token for input + output. Shorter prompts = cheaper.

Hugging Face alternatives:
You can replace Google-managed APIs with open models such as:
- Embeddings: sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-large-en-v1.5
- Generators: google/gemma-2b-it, mistralai/Mistral-7B-Instruct, or tiiuae/falcon-7b-instruct
However, this requires a GPU or large CPU VM (e.g., n1-standard-8 + T4) and manual model management. Rather than use a very expensive machine and GPU in Workbench, you can launch custom jobs that perform the embedding and generation steps. Start with a PyTorch image and add HuggingFace as a requirement.

Step 1: Setup environment


PYTHON

!pip install --quiet --upgrade pypdf

Cost note: Installing packages is free; you’re only billed for VM runtime.

Initialize project

PYTHON

from google.cloud import aiplatform
from vertexai import init as vertexai_init
import os

PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "<YOUR_PROJECT_ID>")
REGION = "us-central1"

aiplatform.init(project=PROJECT_ID, location=REGION)
vertexai_init(project=PROJECT_ID, location=REGION)
print("Initialized:", PROJECT_ID, REGION)

Step 2: Extract and chunk PDFs


PYTHON

import zipfile, pathlib, re, pandas as pd
from pypdf import PdfReader

ZIP_PATH = pathlib.Path("/home/jupyter/Intro_GCP_for_ML/data/pdfs_bundle.zip")
DOC_DIR = pathlib.Path("/home/jupyter/docs")
DOC_DIR.mkdir(exist_ok=True)

# unzip
with zipfile.ZipFile(ZIP_PATH, "r") as zf:
    zf.extractall(DOC_DIR)

def chunk_text(text, max_chars=1200, overlap=150):
    for i in range(0, len(text), max_chars - overlap):
        yield text[i:i+max_chars]

rows = []
for pdf in DOC_DIR.glob("*.pdf"):
    txt = ""
    for page in PdfReader(str(pdf)).pages:
        txt += page.extract_text() or ""
    for i, chunk in enumerate(chunk_text(re.sub(r"\s+", " ", txt))):
        rows.append({"doc": pdf.name, "chunk_id": i, "text": chunk})

import pandas as pd
corpus_df = pd.DataFrame(rows)
print(len(corpus_df), "chunks created")

Cost note: Only VM runtime applies. Chunk size affects future embedding cost.

Step 3: Embed text using Vertex AI


Choosing an embedding and generator model

Vertex AI currently offers multiple managed embedding models under the Text Embeddings API family.
For this exercise, we’re using text-embedding-004, which is Google’s latest general-purpose model optimized for semantic similarity, retrieval, and clustering tasks.

Why this model? - Produces 768-dimensional dense vectors suitable for cosine or dot-product similarity.
- Handles long passages (up to ~8,000 tokens) and multilingual content.
- Tuned for retrieval tasks like RAG, document search, and clustering.
- Cost-efficient for classroom-scale workloads (fractions of a cent per document).

If you’d like to explore other options: - Open the Vertex AI Model Garden → Text Embeddings in your GCP console.
- You’ll find specialized alternatives such as: - text-embedding-005 (experimental) – larger model, higher precision on longer documents.
- multimodal-embedding-001 – supports image + text embeddings for richer use cases.
- Third-party embeddings (via Model Garden) – e.g., bge-large-en, cohere-embed-v3, all-MiniLM.

PYTHON

#############################################
# 1. Imports and client setup
#############################################

from google import genai
from google.genai.types import HttpOptions, EmbedContentConfig, GenerateContentConfig
import numpy as np
from sklearn.neighbors import NearestNeighbors

# We'll assume you already have:
#   corpus_df  -> pandas DataFrame with columns: 'text', 'doc', 'chunk_id'
# If not, you'll need to define/load that before running this cell.


#############################################
# 2. Initialize the Gen AI client
#############################################

# vertexai=True = bill/govern in your GCP project instead of the public endpoint
client = genai.Client(
    http_options=HttpOptions(api_version="v1"),
    vertexai=True,
    project="doit-rci-mlm25-4626",
    location="us-central1",
)

# Generation model for answering questions
GENERATION_MODEL_ID = "gemini-2.5-pro"        # or "gemini-2.5-flash" for cheaper/faster

# Embedding model for retrieval
EMBED_MODEL_ID = "gemini-embedding-001"

# Pick an embedding dimensionality and stick to it across corpus + queries.
EMBED_DIM = 1536  # valid typical choices: 768, 1536, 3072

PYTHON

#############################################
# 3. Helper: get embeddings for a list of texts
#############################################

def embed_texts(text_list, batch_size=32, dims=EMBED_DIM):
    """
    Convert a list of text strings into embedding vectors using gemini-embedding-001.
    Returns a NumPy array of shape (len(text_list), dims).
    """
    vectors = []

    # batch to avoid huge single requests
    for start in range(0, len(text_list), batch_size):
        batch = text_list[start:start+batch_size]

        resp = client.models.embed_content(
            model=EMBED_MODEL_ID,
            contents=batch,
            config=EmbedContentConfig(
                task_type="RETRIEVAL_DOCUMENT",   # optimize embeddings for retrieval/use as chunks
                output_dimensionality=dims,       # must match EMBED_DIM everywhere
            ),
        )

        # resp.embeddings is aligned with 'batch'
        for emb in resp.embeddings:
            vectors.append(emb.values)

    return np.array(vectors, dtype="float32")

PYTHON

#############################################
# 4. Embed the corpus and build the NN index
#############################################

# Create embeddings for every text chunk in the corpus
emb_matrix = embed_texts(corpus_df["text"].tolist(), dims=EMBED_DIM)
print("emb_matrix shape:", emb_matrix.shape)   # (num_chunks, EMBED_DIM)

# Fit NearestNeighbors on those embeddings once
nn = NearestNeighbors(
    metric="cosine",   # cosine distance is standard for semantic similarity
    n_neighbors=5,     # default neighborhood size; can override at query time
)
nn.fit(emb_matrix)


#############################################
# 5. Retrieval: given a query string, get top-k relevant chunks
#############################################

def retrieve(query, k=5):
    """
    Embed the user query with the SAME embedding model/dim,
    then find the top-k most similar corpus chunks.
    Returns a DataFrame of the top matches with a 'similarity' column.
    """

    # Embed the query to the same dimension space as emb_matrix
    query_vec = embed_texts([query], dims=EMBED_DIM)[0]   # shape (EMBED_DIM,)

    # Find nearest neighbors using cosine distance
    distances, indices = nn.kneighbors([query_vec], n_neighbors=k, return_distance=True)

    # Grab those rows from the original corpus
    result_df = corpus_df.iloc[indices[0]].copy()

    # Convert cosine distance -> cosine similarity (1 - distance)
    result_df["similarity"] = 1 - distances[0]

    # Sort by similarity descending (highest similarity first)
    result_df = result_df.sort_values("similarity", ascending=False)

    return result_df

PYTHON

#############################################
# 6. ask(): build grounded prompt + call Gemini to answer
#############################################

def ask(query, top_k=5, temperature=0.2):
    """
    Retrieval-Augmented Generation:
    - retrieve context chunks relevant to `query`
    - stuff those chunks into a prompt
    - ask Gemini to answer ONLY using that context
    """

    # Get top_k most relevant text chunks
    hits = retrieve(query, k=top_k)

    # Build a context block with provenance tags like [doc#chunk-id]
    context_lines = [
        f"[{row.doc}#chunk-{row.chunk_id}] {row.text}"
        for _, row in hits.iterrows()
    ]
    context_block = "\n\n".join(context_lines)

    # Instruction prompt for the model
    prompt = (
        "You are a sustainability analyst. "
        "Use only the following context to answer the question.\n\n"
        f"{context_block}\n\n"
        f"Q: {query}\n"
        "A:"
    )

    # Call the generative model
    response = client.models.generate_content(
        model=GENERATION_MODEL_ID,
        contents=prompt,
        config=GenerateContentConfig(
            temperature=temperature,  # lower = more deterministic, factual
        ),
    )

    # Return the model's answer text
    return response.text

Step 5: Generate answers using Gemini


PYTHON


#############################################
# 7. Test the pipeline end-to-end
#############################################

print(
    ask(
        "What is the name of the benchmark suite presented in a recent paper "
        "for measuring inference energy consumption?"
    )
)
# Expected answer: "ML.ENERGY Benchmark"

Step 6: Cost summary


Step Resource Example Component Cost Driver Typical Range
VM runtime Vertex AI Workbench n1-standard-4 Uptime (hourly) ~$0.20/hr
Embeddings text-embedding-004 Managed API Tokens embedded ~$0.10 / 1M tokens
Retrieval Local NN CPU only None Free
Generation gemini-2.5-flash-001 Managed API Input/output tokens ~$0.25 / 1M tokens
Hugging Face alt T4 VM Local model inference GPU uptime ~$0.35/hr + egress

(Optional) Hugging Face local substitution


To avoid managed API costs, you can instead using Hugging Face models.

```python

Key takeaways


  • Use Vertex AI managed embeddings and Gemini Flash for lightweight, cost-controlled RAG.
  • Cache embeddings; reusing them saves most cost.
  • For open alternatives, use Hugging Face models on GPU VMs (higher cost, more control).
  • This workflow generalizes to any retrieval task — not just sustainability papers.
  • GCP’s managed tools lower barrier for experimentation while keeping enterprise security and IAM intact.
Key Points
  • Vertex AI’s RAG stack = low-op, cost-predictable.
  • Hugging Face = high control, high GPU cost.
  • Keep data local or in GCS to manage egress and compliance.
  • Always cite retrieved chunks for reproducibility and transparency.