Content from Overview of Google Cloud for Machine Learning


Last updated on 2025-09-22 | Edit this page

Estimated time: 11 minutes

Overview

Questions

  • What problem does GCP aim to solve for ML researchers?
  • How does using a notebook as a controller help organize ML workflows in the cloud?

Objectives

  • Understand the basic role of GCP in supporting ML research.
  • Recognize how a notebook can serve as a controller for cloud resources.

Google Cloud Platform (GCP) provides the basic building blocks researchers need to run machine learning (ML) experiments at scale. Instead of working only on your laptop or a high-performance computing (HPC) cluster, you can spin up compute resources on demand, store datasets in the cloud, and run notebooks that act as a “controller” for larger training and tuning jobs.

This workshop focuses on using a simple notebook environment as the control center for your ML workflow. We will not rely on Google’s fully managed Vertex AI platform, but instead show how to use core GCP services (Compute Engine, storage buckets, and SDKs) so you can build and run experiments from scratch.

Why use GCP for machine learning?

GCP provides several advantages that make it a strong option for applied ML:

  • Flexible compute: You can choose the hardware that fits your workload:

    • CPUs for lightweight models, preprocessing, or feature engineering.
    • GPUs (e.g., NVIDIA T4, V100, A100) for training deep learning models.
    • High-memory machines for workloads that need large datasets in memory.
  • Data storage and access: Google Cloud Storage (GCS) buckets act like S3 on AWS — an easy way to store and share datasets between experiments and collaborators.

  • From scratch workflows: Instead of depending on a fully managed ML service, you bring your own frameworks (PyTorch, TensorFlow, scikit-learn, etc.) and run your code the same way you would on your laptop or HPC cluster, but with scalable cloud resources.

  • Cost visibility: Billing dashboards and project-level budgets make it easier to track costs and stay within research budgets.

In short, GCP provides infrastructure that you control from a notebook environment, allowing you to build and run ML workflows just as you would locally, but with access to scalable hardware and storage.

Discussion

Comparing infrastructures

Think about your current research setup:
- Do you mostly use your laptop, HPC cluster, or cloud for ML experiments?
- What benefits would running a cloud-based notebook controller give you?
- If you could offload one infrastructure challenge (e.g., installing GPU drivers, managing storage, or setting up environments), what would it be and why?

Take 3–5 minutes to discuss with a partner or share in the workshop chat.

Key Points
  • GCP provides the core building blocks (compute, storage, networking) for ML research.
  • A notebook can act as a controller to organize cloud workflows and keep experiments reproducible.
  • Using raw infrastructure instead of a fully managed platform gives researchers flexibility while still benefiting from scalable cloud resources.

Content from Data Storage: Setting up GCS


Last updated on 2025-09-22 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How can I store and manage data effectively in GCP for Vertex AI workflows?
  • What are the advantages of Google Cloud Storage (GCS) compared to local or VM storage for machine learning projects?

Objectives

  • Explain data storage options in GCP for machine learning projects.
  • Describe the advantages of GCS for large datasets and collaborative workflows.
  • Outline steps to set up a GCS bucket and manage data within Vertex AI.

Storing data on GCP


Machine learning and AI projects rely on data, making efficient storage and management essential. Google Cloud offers several storage options, but the most common for ML workflows are persistent disks (attached to Compute Engine VMs or Vertex AI Workbench) and Google Cloud Storage (GCS) buckets.

Consult your institution’s IT before handling sensitive data in GCP

As with AWS, do not upload restricted or sensitive data to GCP services unless explicitly approved by your institution’s IT or cloud security team. For regulated datasets (HIPAA, FERPA, proprietary), work with your institution to ensure encryption, restricted access, and compliance with policies.

Options for storage: VM Disks or GCS


What is a VM persistent disk?

A persistent disk is the storage volume attached to a Compute Engine VM or a Vertex AI Workbench notebook. It can store datasets and intermediate results, but it is tied to the lifecycle of the VM.

When to store data directly on a persistent disk

  • Useful for small, temporary datasets processed interactively.
  • Data persists if the VM is stopped, but storage costs continue as long as the disk exists.
  • Not ideal for collaboration, scaling, or long-term dataset storage.
Callout

Limitations of persistent disk storage

  • Scalability: Limited by disk size quota.
  • Sharing: Harder to share across projects or team members.
  • Cost: More expensive per GB compared to GCS for long-term storage.

What is a GCS bucket?

For most ML workflows in Vertex AI, Google Cloud Storage (GCS) buckets are recommended. A GCS bucket is a container in Google’s object storage service where you can store an essentially unlimited number of files. Data in GCS can be accessed from Vertex AI training jobs, Workbench notebooks, and other GCP services using a GCS URI (e.g., gs://your-bucket-name/your-file.csv).


To upload our Titanic dataset to a GCS bucket, we’ll follow these steps:

  1. Log in to the Google Cloud Console.
  2. Create a new bucket (or use an existing one).
  3. Upload your dataset files.
  4. Use the GCS URI to reference your data in Vertex AI workflows.

Detailed procedure

1. Sign in to Google Cloud Console
  • In the search bar, type Storage.
  • Click Cloud Storage > Buckets.
3. Create a new bucket
  • Click Create bucket.
  • Provide a bucket name: Enter a globally unique name. For this workshop, we can use the following naming convention to easily locate our buckets: lastname_titanic
  • Labels (tags): Add labels to track resource usage and billing. If you’re working in a shared account, this step is mandatory. If not, it’s still recommended to help you track your own costs!
    • purpose=workshop
    • data=titanic
    • owner=lastname_firstname
  • Choose a location type: When creating a storage bucket in Google Cloud, the best practice for most machine learning workflows is to use a regional bucket in the same region as your compute resources (for example, us-central1). This setup provides the lowest latency and avoids network egress charges when training jobs read from storage, while also keeping costs predictable. A multi-region bucket, on the other hand, can make sense if your primary goal is broad availability or if collaborators in different regions need reliable access to the same data; the trade-off is higher cost and the possibility of extra egress charges when pulling data into a specific compute region. For most research projects, a regional bucket with the Standard storage class, uniform access control, and public access prevention enabled offers a good balance of performance, security, and affordability.
    • Region (cheapest, good default). For instance, us-central1 (Iowa) costs $0.020 per GB-month.
    • Multi-region (higher redundancy, more expensive).
  • Choose storage class: When creating a bucket, you’ll be asked to choose a storage class, which determines how much you pay for storing data and how often you’re allowed to access it without extra fees.
    • Standard – best for active ML workflows. Training data is read and written often, so this is the safest default.
    • Nearline / Coldline / Archive – designed for backups or rarely accessed files. These cost less per GB to store, but you pay retrieval fees if you read them during training. Not recommended for most ML projects where data access is frequent.
    • Autoclass – automatically moves objects between Standard and lower-cost classes based on activity. Useful if your usage is unpredictable, but can make cost tracking harder.
  • Choose how to control access to objects: By default, you should prevent public access to buckets used for ML projects. This ensures that only people you explicitly grant permissions to can read or write objects, which is almost always the right choice for research, hackathons, or internal collaboration. Public buckets are mainly for hosting datasets or websites that are intentionally shared with the world.
4. Upload files to the bucket
  • If you haven’t downloaded them yet, right-click and save as .csv:
  • In the bucket dashboard, click Upload Files.
  • Select your Titanic CSVs and upload.

Note the GCS URI for your data After uploading, click on a file and find its gs:// URI (e.g., gs://yourname-titanic-gcs/titanic_train.csv). This URI will be used to access the data later.

GCS bucket costs


GCS costs are based on storage class, data transfer, and operations (requests).

Storage costs

  • Standard storage (us-central1): ~$0.02 per GB per month.
  • Other classes (Nearline, Coldline, Archive) are cheaper but with retrieval costs.

Data transfer costs explained

  • Uploading data (ingress): Copying data into a GCS bucket from your laptop, campus HPC, or another provider is free.
  • Accessing data in the same region: If your bucket and your compute resources (VMs, Vertex AI jobs) are in the same region, you can read and stream data with no transfer fees. You only pay the storage cost per GB-month.
  • Cross-region access: If your bucket is in one region and your compute runs in another, you’ll pay an egress fee (about $0.01–0.02 per GB within North America, higher if crossing continents).
  • Downloading data out of GCP (egress): This refers to data leaving Google’s network to the public internet, such as downloading files to your laptop. Typical cost is around $0.12 per GB to the U.S. and North America, more for other continents.
  • Deleting data: Removing objects or buckets does not incur transfer costs. If you download data before deleting, you pay for the egress, but simply deleting in the console or CLI is free. For Nearline/Coldline/Archive storage classes, deleting before the minimum storage duration (30, 90, or 365 days) triggers an early deletion fee.

Request costs

  • GET (read) requests: ~$0.004 per 10,000 requests.
  • PUT (write) requests: ~$0.05 per 10,000 requests.

For detailed pricing, see GCS Pricing Information.

Challenge

Challenge: Estimating Storage Costs

1. Estimate the total cost of storing 1 GB in GCS Standard storage (us-central1) for one month assuming:
- Storage duration: 1 month
- Dataset retrieved 100 times for model training and tuning
- Data is downloaded once out of GCP at the end of the project

Hints
- Storage cost: $0.02 per GB per month
- Egress (download out of GCP): $0.12 per GB
- GET requests: $0.004 per 10,000 requests (100 requests ≈ free for our purposes)

2. Repeat the above calculation for datasets of 10 GB, 100 GB, and 1 TB (1024 GB).

  1. 1 GB:
  • Storage: 1 GB × $0.02 = $0.02
  • Egress: 1 GB × $0.12 = $0.12
  • Requests: ~0 (100 reads well below pricing tier)
  • Total: $0.14
  1. 10 GB:
  • Storage: 10 GB × $0.02 = $0.20
  • Egress: 10 GB × $0.12 = $1.20
  • Requests: ~0
  • Total: $1.40
  1. 100 GB:
  • Storage: 100 GB × $0.02 = $2.00
  • Egress: 100 GB × $0.12 = $12.00
  • Requests: ~0
  • Total: $14.00
  1. 1 TB (1024 GB):
  • Storage: 1024 GB × $0.02 = $20.48
  • Egress: 1024 GB × $0.12 = $122.88
  • Requests: ~0
  • Total: $143.36

Removing unused data (complete after the workshop)


After you are done using your data, remove unused files/buckets to stop costs:

  • Option 1: Delete files only – if you plan to reuse the bucket.
  • Option 2: Delete the bucket entirely – if you no longer need it.

When does BigQuery come into play?


For many ML workflows, especially smaller projects or those centered on image, text, or modest tabular datasets, BigQuery is overkill. GCS buckets are usually enough to store and access your data for training jobs. That said, BigQuery can be valuable when you are working with large tabular datasets and need a shared environment for exploration or collaboration. Instead of every team member downloading the same CSVs, BigQuery lets everyone query the data in place with SQL, share results through saved queries or views, and control access at the dataset or table level with IAM. BigQuery also integrates with Vertex AI, so if your data is already structured and stored there, you can connect it directly to training pipelines. The trade-off is cost: you pay not only for storage but also for the amount of data scanned by queries. For many ML research projects this is unnecessary, but when teams need a centralized, queryable workspace for large tabular data, BigQuery can simplify collaboration.

Key Points
  • Use GCS for scalable, cost-effective, and persistent storage in GCP.
  • Persistent disks are suitable only for small, temporary datasets.
  • Track your storage, transfer, and request costs to manage expenses.
  • Regularly delete unused data or buckets to avoid ongoing costs.

Content from Notebooks as Controllers


Last updated on 2025-09-22 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How do you set up and use Vertex AI Workbench notebooks for machine learning tasks?
  • How can you manage compute resources efficiently using a “controller” notebook approach in GCP?

Objectives

  • Describe how to use Vertex AI Workbench notebooks for ML workflows.
  • Set up a Jupyter-based Workbench instance as a controller to manage compute tasks.
  • Use the Vertex AI SDK to launch training and tuning jobs on scalable instances.

Setting up our notebook environment


Google Cloud Workbench provides JupyterLab-based environments that can be used to orchestrate machine learning workflows. In this workshop, we will use a Workbench Instance—the recommended option going forward, as other Workbench environments are being deprecated.

Workbench Instances come with JupyterLab 3 pre-installed and are configured with GPU-enabled ML frameworks (TensorFlow, PyTorch, etc.), making it easy to start experimenting without additional setup. Learn more in the Workbench Instances documentation.

Using the notebook as a controller

The notebook instance functions as a controller to manage more resource-intensive tasks. By selecting a modest machine type (e.g., n1-standard-4), you can perform lightweight operations locally in the notebook while using the Vertex AI Python SDK to launch compute-heavy jobs on larger machines (e.g., GPU-accelerated) when needed.

This approach minimizes costs while giving you access to scalable infrastructure for demanding tasks like model training, batch prediction, and hyperparameter tuning.

We will follow these steps to create our first Workbench Instance:

  • In the Google Cloud Console, search for “Workbench.”
  • Click the “Instances” tab (this is the supported path going forward).
  • Pin Workbench to your navigation bar for quick access.

2. Create a new Workbench Instance

  • Click “Create New” under Instances.
  • Notebook name: For this workshop, we can use the following naming convention to easily locate our notebooks: lastname-titanic
  • Region: Choose the same region as your storage bucket (e.g., us-central1).
    • This avoids cross-region transfer charges and keeps data access latency low.
  • GPUs: Leave disabled for now (training jobs will request them separately).
  • Labels: Add labels for cost tracking
    • purpose=workshop
    • owner=lastname_firstname
  • Machine type: Select a small machine (e.g., e2-standard-2) to act as the controller.
    • This keeps costs low while you delegate heavy lifting to training jobs.
    • For guidance on common machine types for ML, refer to Instances for ML on GCP.
  • Click Create to create the intance. Your notebook instance will start in a few minutes. When its status is “Running,” you can open JupyterLab and begin working.

Managing training and tuning with the controller notebook

In the following episodes, we will use the Vertex AI Python SDK (google-cloud-aiplatform) from this notebook to submit compute-heavy tasks on more powerful machines. Examples include:

  • Training a model on a GPU-backed instance.
  • Running hyperparameter tuning jobs managed by Vertex AI.

This pattern keeps costs low by running your notebook on a modest VM while only incurring charges for larger resources when they are actively in use.

Challenge

Challenge: Notebook Roles

Your university provides different compute options: laptops, on-prem HPC, and GCP.

  • What role does a Workbench Instance notebook play compared to an HPC login node or a laptop-based JupyterLab?
  • Which tasks should stay in the notebook (lightweight control, visualization) versus being launched to larger cloud resources?

The notebook serves as a lightweight control plane.
- Like an HPC login node, it is not meant for heavy computation.
- Suitable for small preprocessing, visualization, and orchestrating jobs.
- Resource-intensive tasks (training, tuning, batch jobs) should be submitted to scalable cloud resources (GPU/large VM instances) via the Vertex AI SDK.

Key Points
  • Use a small Workbench Instance notebook as a controller to manage larger, resource-intensive tasks.
  • Always navigate to the “Instances” tab in Workbench, since older notebook types are deprecated.
  • Choose the same region for your Workbench Instance and storage bucket to avoid extra transfer costs.
  • Submit training and tuning jobs to scalable instances using the Vertex AI SDK.
  • Labels help track costs effectively, especially in shared or multi-project environments.
  • Workbench Instances come with JupyterLab 3 and GPU frameworks preinstalled, making them an easy entry point for ML workflows.
  • Enable idle auto-stop to avoid unexpected charges when notebooks are left running.

Content from Accessing and Managing Data in GCS with Vertex AI Notebooks


Last updated on 2025-09-22 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How can I load data from GCS into a Vertex AI Workbench notebook?
  • How do I monitor storage usage and costs for my GCS bucket?
  • What steps are involved in pushing new data back to GCS from a notebook?

Objectives

  • Read data directly from a GCS bucket into memory in a Vertex AI notebook.
  • Check storage usage and estimate costs for data in a GCS bucket.
  • Upload new files from the Vertex AI environment back to the GCS bucket.

Initial setup


Open JupyterLab notebook

Once your Vertex AI Workbench notebook instance shows as Running, open it in JupyterLab. Create a new Python 3 notebook and rename it to: Interacting-with-GCS.ipynb.

Set up GCP environment

Before interacting with GCS, we need to authenticate and initialize the client libraries. This ensures our notebook can talk to GCP securely.

PYTHON

from google.cloud import storage
import pandas as pd
import io
client = storage.Client()
print("Project:", client.project)
print("Credentials:", client._credentials.service_account_email)

Reading data from GCS


As with S3, you can either (A) read data directly from GCS into memory, or (B) download a copy into your notebook VM. Since we’re using notebooks as controllers rather than training environments, the recommended approach is reading directly from GCS.

A) Reading data directly into memory

PYTHON


bucket_name = "yourname_titanic"
bucket = client.bucket(bucket_name)
blob = bucket.blob("titanic_train.csv")
train_data = pd.read_csv(io.BytesIO(blob.download_as_bytes()))
print(train_data.shape)
train_data.head()

B) Downloading a local copy

PYTHON

bucket_name = "yourname-titanic-gcs"
blob_name = "titanic_train.csv"
local_path = "/home/jupyter/titanic_train.csv"

bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.download_to_filename(local_path)

!ls -lh /home/jupyter/

Checking storage usage of a bucket


PYTHON

total_size_bytes = 0
bucket = client.bucket(bucket_name)

for blob in client.list_blobs(bucket_name):
    total_size_bytes += blob.size

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{bucket_name}': {total_size_mb:.2f} MB")

Estimating storage costs


PYTHON

storage_price_per_gb = 0.02  # $/GB/month for Standard storage
total_size_gb = total_size_bytes / (1024**3)
monthly_cost = total_size_gb * storage_price_per_gb

print(f"Estimated monthly cost: ${monthly_cost:.4f}")
print(f"Estimated annual cost: ${monthly_cost*12:.4f}")

For updated prices, see GCS Pricing.

Writing output files to GCS


PYTHON

# Create a sample file locally on the notebook VM
with open("Notes.txt", "w") as f:
    f.write("This is a test note for GCS.")

# Point to the right bucket
bucket = client.bucket(bucket_name)

# Create a *Blob* object, which represents a path inside the bucket
# (here it will end up as gs://<bucket_name>/docs/Notes.txt)
blob = bucket.blob("docs/Notes.txt")

# Upload the local file into that blob (object) in GCS
blob.upload_from_filename("Notes.txt")

print("File uploaded successfully.")

List bucket contents:

PYTHON

for blob in client.list_blobs(bucket_name):
    print(blob.name)
Challenge

Challenge: Estimating GCS Costs

Suppose you store 50 GB of data in Standard storage (us-central1) for one month.
- Estimate the monthly storage cost.
- Then estimate the cost if you download (egress) the entire dataset once at the end of the month.

Hints
- Storage: $0.02 per GB-month
- Egress: $0.12 per GB

  • Storage cost: 50 GB × $0.02 = $1.00
  • Egress cost: 50 GB × $0.12 = $6.00
  • Total cost: $7.00 for one month including one full download
Key Points
  • Load data from GCS into memory to avoid managing local copies when possible.
  • Periodically check storage usage and costs to manage your GCS budget.
  • Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.

Content from Using a GitHub Personal Access Token (PAT) to Push/Pull from a Vertex AI Notebook


Last updated on 2025-08-27 | Edit this page

Estimated time: 35 minutes

Overview

Questions

  • How can I securely push/pull code to and from GitHub within a Vertex AI Workbench notebook?
  • What steps are necessary to set up a GitHub PAT for authentication in GCP?
  • How can I convert notebooks to .py files and ignore .ipynb files in version control?

Objectives

  • Configure Git in a Vertex AI Workbench notebook to use a GitHub Personal Access Token (PAT) for HTTPS-based authentication.
  • Securely handle credentials in a notebook environment using getpass.
  • Convert .ipynb files to .py files for better version control practices in collaborative projects.

Step 0: Initial setup


In the previous episode, we cloned our forked repository as part of the workshop setup. In this episode, we’ll see how to push our code to this fork. Complete these three setup steps before moving forward.

  1. Clone the fork if you haven’t already. See previous episode.

  2. Start a new Jupyter notebook, and name it something like Interacting-with-git.ipynb. We can use the default Python 3 kernel in Vertex AI Workbench.

  3. Change directory to the workspace where your repository is located. In Vertex AI Workbench, notebooks usually live under /home/jupyter/.

PYTHON

%cd /home/jupyter/

Step 1: Using a GitHub personal access token (PAT) to push/pull from a Vertex AI notebook


When working in Vertex AI Workbench notebooks, you may often need to push code updates to GitHub repositories. Since Workbench VMs may be stopped and restarted, configurations like SSH keys may not persist. HTTPS-based authentication with a GitHub Personal Access Token (PAT) is a practical solution. PATs provide flexibility for authentication and enable seamless interaction with both public and private repositories directly from your notebook.

Important Note: Personal access tokens are powerful credentials. Select the minimum necessary permissions and handle the token carefully.

Generate a personal access token (PAT) on GitHub

  1. Go to Settings in GitHub.
  2. Click Developer settings at the bottom of the left sidebar.
  3. Select Personal access tokens, then click Tokens (classic).
  4. Click Generate new token (classic).
  5. Give your token a descriptive name and set an expiration date if desired.
  6. Select minimum permissions:
    • Public repos: public_repo
    • Private repos: repo
  7. Click Generate token and copy it immediately—you won’t be able to see it again.

Caution: Treat your PAT like a password. Don’t share it or expose it in your code. Use a password manager to store it.

Use getpass to prompt for username and PAT

PYTHON

import getpass

# Prompt for GitHub username and PAT securely
username = input("GitHub Username: ")
token = getpass.getpass("GitHub Personal Access Token (PAT): ")

This way credentials aren’t hard-coded into your notebook.

Step 2: Configure Git settings


PYTHON

!git config --global user.name "Your Name" 
!git config --global user.email your_email@wisc.edu
  • user.name: Will appear in the commit history.
  • user.email: Must match your GitHub account so commits are linked to your profile.

Step 3: Convert .ipynb notebooks to .py


Tracking .py files instead of .ipynb helps with cleaner version control. Notebooks store outputs and metadata, which makes diffs noisy. .py files are lighter and easier to review.

  1. Install Jupytext.

PYTHON

!pip install jupytext
  1. Convert a notebook to .py.

PYTHON

!jupytext --to py Interacting-with-GCS.ipynb
  1. Convert all notebooks in the current directory.

PYTHON

import subprocess, os

for nb in [f for f in os.listdir() if f.endswith('.ipynb')]:
    pyfile = nb.replace('.ipynb', '.py')
    subprocess.run(["jupytext", "--to", "py", nb, "--output", pyfile])
    print(f"Converted {nb} to {pyfile}")

Step 4: Add and commit .py files


PYTHON

%cd /home/jupyter/your-repo
!git status
!git add .
!git commit -m "Converted notebooks to .py files for version control"

Step 5: Add .ipynb to .gitignore


PYTHON

!touch .gitignore
with open(".gitignore", "a") as gitignore:
    gitignore.write("\n# Ignore Jupyter notebooks\n*.ipynb\n")
!cat .gitignore

Add other temporary files too:

PYTHON

with open(".gitignore", "a") as gitignore:
    gitignore.write("\n# Ignore cache and temp files\n__pycache__/\n*.tmp\n*.log\n")

Commit the .gitignore:

PYTHON

!git add .gitignore
!git commit -m "Add .ipynb and temp files to .gitignore"

Step 6: Syncing with GitHub


First, pull the latest changes:

PYTHON

!git config pull.rebase false
!git pull origin main

If conflicts occur, resolve manually before committing.

Then push with your PAT credentials:

PYTHON

github_url = f'github.com/{username}/your-repo.git'
!git push https://{username}:{token}@{github_url} main

Step 7: Convert .py back to notebooks (optional)


To convert .py files back to .ipynb after pulling updates:

PYTHON

!jupytext --to notebook Interacting-with-GCS.py --output Interacting-with-GCS.ipynb
Challenge

Challenge: GitHub PAT Workflow

  • Why might you prefer using a PAT with HTTPS instead of SSH keys in Vertex AI Workbench?
  • What are the benefits of converting .ipynb files to .py before committing to a shared repo?
  • PATs with HTTPS are easier to set up in temporary environments where SSH configs don’t persist.
  • Converting notebooks to .py results in cleaner diffs, easier code review, and smaller repos without stored outputs/metadata.
Key Points
  • Use a GitHub PAT for HTTPS-based authentication in Vertex AI Workbench notebooks.
  • Securely enter sensitive information in notebooks using getpass.
  • Converting .ipynb files to .py files helps with cleaner version control.
  • Adding .ipynb files to .gitignore keeps your repository organized.

Content from Training Models in Vertex AI: Intro


Last updated on 2025-09-24 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • What are the differences between training locally in a Vertex AI notebook and using Vertex AI-managed training jobs?
  • How do custom training jobs in Vertex AI streamline the training process for various frameworks?
  • How does Vertex AI handle scaling across CPUs, GPUs, and TPUs?

Objectives

  • Understand the difference between local training in a Vertex AI Workbench notebook and submitting managed training jobs.
  • Learn to configure and use Vertex AI custom training jobs for different frameworks (e.g., XGBoost, PyTorch, SKLearn).
  • Understand scaling options in Vertex AI, including when to use CPUs, GPUs, or TPUs.
  • Compare performance, cost, and setup between custom scripts and pre-built containers in Vertex AI.
  • Conduct training with data stored in GCS and monitor training job status using the Google Cloud Console.

Initial setup


1. Open a new .ipynb notebook

Open a fresh Jupyter notebook inside your Vertex AI Workbench instance. You can name it something along the lines of, Training-models.ipynb.

2. CD to instance home directory

So we all can reference helper functions consistently, change directory to your Jupyter home directory.

PYTHON

%cd /home/jupyter/

3. Initialize Vertex AI environment

This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.

PYTHON

from google.cloud import aiplatform
import pandas as pd

# Set your project and region (replace with your values)
PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-gcs-bucket"

# Initialize Vertex AI client
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")
  • aiplatform.init(): Sets defaults for project, region, and staging bucket.
  • PROJECT_ID: Identifies your GCP project.
  • REGION: Determines where training jobs run (choose a region close to your data).
  • staging_bucket: A GCS bucket for storing datasets, model artifacts, and job outputs.

4. Get code from GitHub repo (skip if already completed)

If you didn’t complete earlier episodes, clone our code repo before moving forward. Check to make sure we’re in our Jupyter home folder first.

PYTHON

%cd /home/jupyter/

PYTHON

!git clone https://github.com/qualiaMachine/Intro_GCP_for_ML.git

Testing train.py locally in the notebook


Before scaling training jobs onto managed resources, it’s essential to test your training script locally. This prevents wasting GPU/TPU time on bugs or misconfigured code.

Guidelines for testing ML pipelines before scaling

  • Run tests locally first with small datasets.
  • Use a subset of your dataset (1–5%) for fast checks.
  • Start with minimal compute before moving to larger accelerators.
  • Log key metrics such as loss curves and runtimes.
  • Verify correctness first before scaling up.
Discussion

What tests should we do before scaling?

Before scaling to multiple or more powerful instances (e.g., GPUs or TPUs), it’s important to run a few sanity checks. In your group, discuss:

  • Which checks do you think are most critical before scaling up?
  • What potential issues might we miss if we skip this step?
  • Data loads correctly – dataset loads without errors, expected columns exist, missing values handled.
  • Overfitting check – train on a tiny dataset (e.g., 100 rows). If it doesn’t overfit, something is off.
  • Loss behavior – verify training loss decreases and doesn’t diverge.
  • Runtime estimate – get a rough sense of training time on small data.
  • Memory estimate – check approximate memory use.
  • Save & reload – ensure model saves, reloads, and infers without errors.

Skipping these can lead to: silent data bugs, runtime blowups at scale, inefficient experiments, or broken model artifacts.

Download data into notebook environment


Sometimes it’s helpful to keep a copy of data in your notebook VM for quick iteration, even though GCS is the preferred storage location.

PYTHON

from google.cloud import storage

client = storage.Client()
bucket = client.bucket(bucket_name)

blob = bucket.blob("titanic_train.csv")
blob.download_to_filename("titanic_train.csv")

print("Downloaded titanic_train.csv")

Local test run of train.py


PYTHON

import time as t

start = t.time()

# Example: run your custom training script with args
!python GCP_helpers/train_xgboost.py --max_depth 3 --eta 0.1 --subsample 0.8 --colsample_bytree 0.8 --num_round 100 --train titanic_train.csv

print(f"Total local runtime: {t.time() - start:.2f} seconds")

Training on this small dataset should take <1 minute. Log runtime as a baseline. You should see the following output files:

  • xgboost-model.joblib # Python-serialized XGBoost model (Booster) via joblib; load with joblib.load for reuse.
  • eval_history.csv # Per-iteration validation metrics; columns: iter,val_logloss (good for plotting learning curves).
  • training.log # Full stdout/stderr from the run (params, dataset sizes, timings, warnings/errors) for audit/debug.
  • metrics.json # Structured summary: final_val_logloss, num_boost_round, params, train_rows/val_rows, features[], model_uri.

Training via Vertex AI custom training job


Unlike “local” training, this launches a managed training job that runs on scalable compute. Vertex AI handles provisioning, scaling, logging, and saving outputs to GCS.

Which machine type to start with?

Start with a small CPU machine like n1-standard-4. Only scale up to GPUs/TPUs once you’ve verified your script. See Instances for ML on GCP for guidance.

Creating a custom training job with the SDK

PYTHON

from google.cloud import aiplatform
import datetime as dt

PROJECT = "doit-rci-mlm25-4626"
REGION = "us-central1"
BUCKET = bucket_name  # e.g., "endemann_titanic" (same region as REGION)

RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
MODEL_URI = f"gs://{BUCKET}/artifacts/xgb/{RUN_ID}/model.joblib"  # everything will live beside this

# Staging bucket is only for the SDK's temp code tarball (aiplatform-*.tar.gz)
aiplatform.init(project=PROJECT, location=REGION, staging_bucket=f"gs://{BUCKET}")

job = aiplatform.CustomTrainingJob(
    display_name=f"endemann_xgb_{RUN_ID}",
    script_path="Intro_GCP_VertexAI/code/train_xgboost.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.2-1:latest",
    requirements=["gcsfs"],  # script writes gs://MODEL_URI and sidecar files
)

job.run(
    args=[
        f"--train=gs://{BUCKET}/titanic_train.csv",
        f"--model_out={MODEL_URI}",      # model, metrics.json, eval_history.csv, training.log all go here
        "--max_depth=3",
        "--eta=0.1",
        "--subsample=0.8",
        "--colsample_bytree=0.8",
        "--num_round=100",
    ],
    replica_count=1,
    machine_type="n1-standard-4",
    sync=True,
)

print("Model + logs folder:", MODEL_URI.rsplit("/", 1)[0])

This launches a managed training job with Vertex AI.

Monitoring training jobs in the Console


  1. Go to the Google Cloud Console.
  2. Navigate to Vertex AI > Training > Custom Jobs.
  3. Click on your job name to see status, logs, and output model artifacts.
  4. Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).

Visit “training pipelines” to verify it’s running. It may take around 5 minutes to finish.

https://console.cloud.google.com/vertex-ai/training/training-pipelines?hl=en&project=doit-rci-mlm25-4626

Should output the following files:

  • endemann_titanic/artifacts/xgb/20250924-154740/xgboost-model.joblib # Python-serialized XGBoost model (Booster) via joblib; load with joblib.load for reuse.
  • endemann_titanic/artifacts/xgb/20250924-154740/eval_history.csv # Per-iteration validation metrics; columns: iter,val_logloss (good for plotting learning curves).
  • endemann_titanic/artifacts/xgb/20250924-154740/training.log # Full stdout/stderr from the run (params, dataset sizes, timings, warnings/errors) for audit/debug.
  • endemann_titanic/artifacts/xgb/20250924-154740/metrics.json # Structured summary: final_val_logloss, num_boost_round, params, train_rows/val_rows, features[], model_uri.

When training takes too long


Two main options in Vertex AI:

  • Option 1: Upgrade to more powerful machine types (e.g., add GPUs like T4, V100, A100).
  • Option 2: Use distributed training with multiple replicas.

Option 1: Upgrade machine type (preferred first step)

  • Works best for small/medium datasets (<10 GB).
  • Avoids the coordination overhead of distributed training.
  • GPUs/TPUs accelerate deep learning tasks significantly.

Option 2: Distributed training with multiple replicas

  • Supported in Vertex AI for many frameworks.
  • Split data across replicas, each trains a portion, gradients synchronized.
  • More beneficial for very large datasets and long-running jobs.

When distributed training makes sense

  • Dataset >10–50 GB.
  • Training time >10 hours on single machine.
  • Deep learning workloads that naturally parallelize across GPUs/TPUs.
Key Points
  • Environment initialization: Use aiplatform.init() to set defaults for project, region, and bucket.
  • Local vs managed training: Test locally before scaling into managed jobs.
  • Custom jobs: Vertex AI lets you run scripts as managed training jobs using pre-built or custom containers.
  • Scaling: Start small, then scale up to GPUs or distributed jobs as dataset/model size grows.
  • Monitoring: Track job logs and artifacts in the Vertex AI Console.

Content from Training Models in Vertex AI: PyTorch Example


Last updated on 2025-09-24 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • When should you consider a GPU (or TPU) instance for PyTorch training in Vertex AI, and what are the trade‑offs for small vs. large workloads?
  • How do you launch a script‑based training job and write all artifacts (model, metrics, logs) next to each other in GCS without deploying a managed model?

Objectives

  • Prepare the Titanic dataset and save train/val arrays to compressed .npz files in GCS.
  • Submit a CustomTrainingJob that runs a PyTorch script and explicitly writes outputs to a chosen gs://…/artifacts/.../ folder.
  • Co‑locate artifacts: model.pt (or .joblib), metrics.json, eval_history.csv, and training.log for reproducibility.
  • Choose CPU vs. GPU instances sensibly; understand when distributed training is (not) worth it.

Initial setup (controller notebook)


Open a fresh Jupyter notebook in Vertex AI Workbench (Instances tab) and initialize:

PYTHON

from google.cloud import aiplatform, storage
import datetime as dt

PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-bucket"  # same region as REGION

# Only used for the SDK's small packaging tarball.
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")

Select the PyTorch environment (kernel)

  • In JupyterLab, click the kernel name (top‑right) and switch to a PyTorch‑ready kernel. On Workbench Instances this is usually available out‑of‑the‑box; if import torch fails, install locally:

    BASH

    pip install torch torchvision --upgrade
  • Quick check that your kernel can see PyTorch (and optionally CUDA if your VM has a GPU):

    PYTHON

    import torch
    print("torch:", torch.__version__, "cuda:", torch.cuda.is_available())
  • Note: local PyTorch is only needed for local tests. Your Vertex AI job uses the container specified by container_uri (e.g., pytorch-cpu.2-1 or pytorch-gpu.2-1), so it brings its own framework at run time.

Notes: - The staging bucket only stores the SDK’s temporary tar.gz of your training code. - We will not use base_output_dir; your script will write everything under a single gs://…/artifacts/.../ path.

Prepare data as .npz


PYTHON

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load Titanic CSV (from local or GCS you've already downloaded to the notebook)
df = pd.read_csv("titanic_train.csv")

# Minimal preprocessing to numeric arrays
sex_enc = LabelEncoder().fit(df["Sex"])  
df["Sex"] = sex_enc.transform(df["Sex"])  
df["Embarked"] = df["Embarked"].fillna("S")
emb_enc = LabelEncoder().fit(df["Embarked"])  
df["Embarked"] = emb_enc.transform(df["Embarked"])  
df["Age"] = df["Age"].fillna(df["Age"].median())
df["Fare"] = df["Fare"].fillna(df["Fare"].median())

X = df[["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].values
y = df["Survived"].values

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

np.savez("train_data.npz", X_train=X_train, y_train=y_train)
np.savez("val_data.npz",   X_val=X_val,   y_val=y_val)

# Upload to GCS
client = storage.Client()
bucket = client.bucket(BUCKET_NAME)
bucket.blob("data/train_data.npz").upload_from_filename("train_data.npz")
bucket.blob("data/val_data.npz").upload_from_filename("val_data.npz")
print("Uploaded: gs://%s/data/train_data.npz and val_data.npz" % BUCKET_NAME)
Callout

Why .npz?

  • Smaller, faster I/O than CSV for arrays.
  • Natural fit for torch.utils.data.Dataset / DataLoader.
  • One file can hold multiple arrays (X_train, y_train).

Minimal PyTorch training script (train_nn.py)


Place this file in your repo (e.g., GCP_helpers/train_nn.py). It does three things: 1) loads .npz from local or GCS, 2) trains a tiny MLP, 3) writes all outputs side‑by‑side (model + metrics + eval history + training.log) to the same --model_out folder.

PYTHON

# GCP_helpers/train_nn.py
import argparse, io, json, os, sys
import numpy as np
import torch, torch.nn as nn
from time import time

# --- small helpers for GCS/local I/O ---
def _parent_dir(p):
    return p.rsplit("/", 1)[0] if p.startswith("gs://") else (os.path.dirname(p) or ".")

def _write_bytes(path: str, data: bytes):
    if path.startswith("gs://"):
        try:
            import fsspec
            with fsspec.open(path, "wb") as f:
                f.write(data)
        except Exception:
            from google.cloud import storage
            b, k = path[5:].split("/", 1)
            storage.Client().bucket(b).blob(k).upload_from_string(data)
    else:
        os.makedirs(_parent_dir(path), exist_ok=True)
        with open(path, "wb") as f:
            f.write(data)

def _write_text(path: str, text: str):
    _write_bytes(path, text.encode("utf-8"))

# --- tiny MLP ---
class MLP(nn.Module):
    def __init__(self, d_in):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 32), nn.ReLU(),
            nn.Linear(32, 16), nn.ReLU(),
            nn.Linear(16, 1), nn.Sigmoid(),
        )
    def forward(self, x):
        return self.net(x)

class _Tee:
    def __init__(self, *s): self.s = s
    def write(self, d):
        for x in self.s: x.write(d); x.flush()
    def flush(self):
        for x in self.s: x.flush()

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--train", required=True)
    ap.add_argument("--val",   required=True)
    ap.add_argument("--epochs", type=int, default=100)
    ap.add_argument("--learning_rate", type=float, default=1e-3)
    ap.add_argument("--model_out", required=True, help="gs://…/artifacts/.../model.pt")
    args = ap.parse_args()

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # All artifacts will sit next to model_out
    model_path = args.model_out
    art_dir = _parent_dir(model_path)

    # capture stdout/stderr
    buf = io.StringIO()
    orig_out, orig_err = sys.stdout, sys.stderr
    sys.stdout = _Tee(sys.stdout, buf)
    sys.stderr = _Tee(sys.stderr, buf)
    log_path = f"{art_dir}/training.log"

    try:
        # Load npz (supports gs:// via fsspec)
        def _npz_load(p):
            if p.startswith("gs://"):
                import fsspec
                with fsspec.open(p, "rb") as f:
                    by = f.read()
                return np.load(io.BytesIO(by))
            else:
                return np.load(p)
        train = _npz_load(args.train)
        val   = _npz_load(args.val)
        Xtr, ytr = train["X_train"].astype("float32"), train["y_train"].astype("float32")
        Xva, yva = val["X_val"].astype("float32"),   val["y_val"].astype("float32")

        Xtr_t = torch.from_numpy(Xtr).to(device)
        ytr_t = torch.from_numpy(ytr).view(-1,1).to(device)
        Xva_t = torch.from_numpy(Xva).to(device)
        yva_t = torch.from_numpy(yva).view(-1,1).to(device)

        model = MLP(Xtr.shape[1]).to(device)
        opt = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
        loss_fn = nn.BCELoss()

        hist = []
        t0 = time()
        for ep in range(1, args.epochs+1):
            model.train()
            opt.zero_grad()
            pred = model(Xtr_t)
            loss = loss_fn(pred, ytr_t)
            loss.backward(); opt.step()

            model.eval()
            with torch.no_grad():
                val_loss = loss_fn(model(Xva_t), yva_t).item()
            hist.append(val_loss)
            if ep % 10 == 0 or ep == 1:
                print(f"epoch={ep} val_loss={val_loss:.4f}")
        print(f"Training time: {time()-t0:.2f}s on {device}")

        # save model
        torch.save(model.state_dict(), model_path)
        print(f"[INFO] Saved model: {model_path}")

        # metrics.json and eval_history.csv
        import json
        metrics = {
            "final_val_loss": float(hist[-1]) if hist else None,
            "epochs": int(args.epochs),
            "learning_rate": float(args.learning_rate),
            "train_rows": int(Xtr.shape[0]),
            "val_rows": int(Xva.shape[0]),
            "features": list(range(Xtr.shape[1])),
            "model_uri": model_path,
            "device": str(device),
        }
        from io import StringIO
        _write_text(f"{art_dir}/metrics.json", json.dumps(metrics, indent=2))
        csv = "iter,val_loss\n" + "\n".join(f"{i+1},{v}" for i, v in enumerate(hist))
        _write_text(f"{art_dir}/eval_history.csv", csv)
    finally:
        # persist log and restore streams
        try:
            _write_text(log_path, buf.getvalue())
        except Exception as e:
            print(f"[WARN] could not write log: {e}")
        sys.stdout, sys.stderr = orig_out, orig_err

if __name__ == "__main__":
    main()

Launch the training job (no base_output_dir)


PYTHON

RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch/{RUN_ID}"
MODEL_URI = f"{ARTIFACT_DIR}/model.pt"   # model + metrics + logs will live here together

job = aiplatform.CustomTrainingJob(
    display_name=f"pytorch_nn_{RUN_ID}",
    script_path="GCP_helpers/train_nn.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-cpu.2-1:latest",  # or pytorch-gpu.2-1
    requirements=["torch", "numpy", "fsspec", "gcsfs"],
)

job.run(
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        f"--epochs=200",
        f"--learning_rate=0.001",
        f"--model_out={MODEL_URI}",   # drives where *all* artifacts go
    ],
    replica_count=1,
    machine_type="n1-standard-4",  # CPU fine for small datasets
    sync=True,
)

print("Artifacts folder:", ARTIFACT_DIR)

What you’ll see in gs://…/artifacts/pytorch/<RUN_ID>/: - model.pt — PyTorch weights (state_dict). - metrics.json — final val loss, hyperparameters, dataset sizes, device, model URI. - eval_history.csv — per‑epoch validation loss (for plots/regression checks). - training.log — complete stdout/stderr for reproducibility and debugging.

Optional: GPU training


For larger models or heavier data:

PYTHON

job = aiplatform.CustomTrainingJob(
    display_name=f"pytorch_nn_gpu_{RUN_ID}",
    script_path="GCP_helpers/train_nn.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-1:latest",
    requirements=["torch", "numpy", "fsspec", "gcsfs"],
)

job.run(
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        f"--epochs=200",
        f"--learning_rate=0.001",
        f"--model_out={MODEL_URI}",
    ],
    replica_count=1,
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    sync=True,
)

GPU tips: - On small problems, GPU startup/transfer overhead can erase speedups—benchmark before you scale. - Stick to a single replica unless your batch sizes and dataset really warrant data parallelism.

Distributed training (when to consider)


  • Data parallelism (DDP) helps when a single GPU is saturated by batch size/throughput. For most workshop‑scale models, a single machine/GPU is simpler and cheaper.
  • Model parallelism is for very large networks that don’t fit on one device—overkill for this lesson.

Monitoring jobs & finding outputs


  • Console → Vertex AI → Training → Custom Jobs → your run → “Output directory” shows the container logs and the environment’s AIP_MODEL_DIR.
  • Your script writes model + metrics + eval history + training.log next to --model_out, e.g., gs://<bucket>/artifacts/pytorch/<RUN_ID>/.
Key Points
  • Use CustomTrainingJob with a prebuilt PyTorch container; let your script control outputs via --model_out.
  • Keep artifacts together (model, metrics, history, log) in one folder for reproducibility.
  • .npz speeds up loading and plays nicely with PyTorch.
  • Start on CPU for small datasets; use GPU only when profiling shows a clear win.
  • Skip base_output_dir unless you specifically want Vertex’s default run directory; staging bucket is just for the SDK packaging tarball.

Content from Hyperparameter Tuning in Vertex AI: Neural Network Example


Last updated on 2025-08-27 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How can we efficiently manage hyperparameter tuning in Vertex AI?
  • How can we parallelize tuning jobs to optimize time without increasing costs?

Objectives

  • Set up and run a hyperparameter tuning job in Vertex AI.
  • Define search spaces for ContinuousParameter and CategoricalParameter.
  • Log and capture objective metrics for evaluating tuning success.
  • Optimize tuning setup to balance cost and efficiency, including parallelization.

To conduct efficient hyperparameter tuning with neural networks (or any model) in Vertex AI, we’ll use Vertex AI’s Hyperparameter Tuning Jobs. The key is defining a clear search space, ensuring metrics are properly logged, and keeping costs manageable by controlling the number of trials and level of parallelization.

Key steps for hyperparameter tuning

The overall process involves these steps:

  1. Prepare training script and ensure metrics are logged.
  2. Define hyperparameter search space.
  3. Configure a hyperparameter tuning job in Vertex AI.
  4. Set data paths and launch the tuning job.
  5. Monitor progress in the Vertex AI Console.
  6. Extract best model and evaluate.

0. Directory setup

Change directory to your Jupyter home folder.

PYTHON

%cd /home/jupyter/

1. Prepare training script with metric logging

Your training script (train_nn.py) should periodically print validation accuracy in a format that Vertex AI can capture.

PYTHON

if (epoch + 1) % 100 == 0 or epoch == epochs - 1:
    print(f"validation_accuracy: {val_accuracy:.4f}", flush=True)

Vertex AI automatically captures metrics logged in this format (key: value).

2. Define hyperparameter search space

In Vertex AI, you specify hyperparameter ranges when configuring the tuning job. You can define both discrete and continuous ranges.

PYTHON

parameter_spec = {
    "epochs": aiplatform.hyperparameter_tuning_utils.IntegerParameterSpec(min=100, max=1000, scale="linear"),
    "learning_rate": aiplatform.hyperparameter_tuning_utils.DoubleParameterSpec(min=0.001, max=0.1, scale="log")
}
  • IntegerParameterSpec: Defines integer ranges.
  • DoubleParameterSpec: Defines continuous ranges, with optional scaling.

3. Configure hyperparameter tuning job

PYTHON

from google.cloud import aiplatform

job = aiplatform.CustomJob(
    display_name="pytorch-train-hpt",
    script_path="GCP_helpers/train_nn.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
    requirements=["torch", "pandas", "numpy", "scikit-learn"],
    model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-13:latest",
)

hpt_job = aiplatform.HyperparameterTuningJob(
    display_name="pytorch-hpt-job",
    custom_job=job,
    metric_spec={"validation_accuracy": "maximize"},
    parameter_spec=parameter_spec,
    max_trial_count=4,
    parallel_trial_count=2,
)

4. Launch the hyperparameter tuning job

PYTHON

hpt_job.run(
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    args=[
        "--train=gs://{}/train_data.npz".format(BUCKET_NAME),
        "--val=gs://{}/val_data.npz".format(BUCKET_NAME),
        "--epochs=100",
        "--learning_rate=0.001"
    ]
)
  • max_trial_count: Total number of configurations tested.
  • parallel_trial_count: Number of trials run at once (recommend ≤4 to let adaptive search improve).

5. Monitor tuning job in Vertex AI Console

  1. Navigate to Vertex AI > Training > Hyperparameter tuning jobs.
  2. View trial progress, logs, and metrics.
  3. Cancel jobs from the console if needed.

6. Extract and evaluate the best model

PYTHON

best_trial = hpt_job.trials[0]  # Best trial listed first after completion
print("Best hyperparameters:", best_trial.parameters)
print("Best objective value:", best_trial.final_measurement.metrics)

You can then load the best model artifact from the associated GCS path and evaluate on test data.

Discussion

What is the effect of parallelism in tuning?

  • How might running 10 trials in parallel differ from running 2 at a time in terms of cost, time, and quality of results?
  • When would you want to prioritize speed over adaptive search benefits?
Key Points
  • Vertex AI Hyperparameter Tuning Jobs let you efficiently explore parameter spaces using adaptive strategies.
  • Always test with max_trial_count=1 first to confirm your setup works.
  • Limit parallel_trial_count to a small number (2–4) to benefit from adaptive search.
  • Use GCS for input/output and monitor jobs through the Vertex AI Console.

Content from Resource Management & Monitoring on Vertex AI (GCP)


Last updated on 2025-08-27 | Edit this page

Estimated time: 65 minutes

Overview

Questions

  • How do I monitor and control Vertex AI, Workbench, and GCS costs day‑to‑day?
  • What specifically should I stop, delete, or schedule to avoid surprise charges?
  • How can I automate cleanup and set alerting so leaks get caught quickly?

Objectives

  • Identify all major cost drivers across Vertex AI (training jobs, endpoints, Workbench notebooks, batch prediction) and GCS.
  • Practice safe cleanup for Managed and User‑Managed Workbench notebooks, training/tuning jobs, batch predictions, models, endpoints, and artifacts.
  • Configure budgets, labels, and basic lifecycle policies to keep costs predictable.
  • Use gcloud/gsutil commands for auditing and rapid cleanup; understand when to prefer the Console.
  • Draft simple automation patterns (Cloud Scheduler + gcloud) to enforce idle shutdown.

What costs you money on GCP (quick map)


  • Vertex AI training jobs (Custom Jobs, Hyperparameter Tuning Jobs) — billed per VM/GPU hour while running.
  • Vertex AI endpoints (online prediction) — billed per node‑hour 24/7 while deployed, even if idle.
  • Vertex AI batch prediction jobs — billed for the job’s compute while running.
  • Vertex AI Workbench notebooks — the backing VM and disk bill while running (and disks bill even when stopped).
  • GCS buckets — storage class, object count/size, versioning, egress, and request ops.
  • Artifact Registry (containers, models) — storage for images and large artifacts.
  • Network egress — downloading data out of GCP (e.g., to your laptop) incurs cost.
  • Logging/Monitoring — high‑volume logs/metrics can add up (rare in small workshops, real in prod).

Rule of thumb: Endpoints left deployed and notebooks left running are the most common surprise bills in education/research settings.

A daily “shutdown checklist” (use now, automate later)


  1. Workbench notebooks — stop the runtime/instance when you’re done.
  2. Custom/HPT jobs — confirm no jobs stuck in RUNNING.
  3. Endpoints — undeploy models and delete unused endpoints.
  4. Batch predictions — ensure no jobs queued or running.
  5. Artifacts — delete large intermediate artifacts you won’t reuse.
  6. GCS — keep only one “source of truth”; avoid duplicate datasets in multiple buckets/regions.

Shutting down Vertex AI Workbench notebooks


Vertex AI has two notebook flavors; follow the matching steps:

  • Console: Vertex AI → WorkbenchManaged notebooks → select runtime → Stop.

  • Idle shutdown: Edit runtime → enable Idle shutdown (e.g., 60–120 min).

  • CLI:

    BASH

    # List managed runtimes (adjust region)
    gcloud notebooks runtimes list --location=us-central1
    # Stop a runtime
    gcloud notebooks runtimes stop RUNTIME_NAME --location=us-central1

User‑Managed Notebooks

  • Console: Vertex AI → WorkbenchUser‑managed notebooks → select instance → Stop.

  • CLI:

    BASH

    # List user-managed instances (adjust zone)
    gcloud notebooks instances list --location=us-central1-b
    # Stop an instance
    gcloud notebooks instances stop INSTANCE_NAME --location=us-central1-b

Disks still cost money while the VM is stopped. Delete old runtimes/instances and their disks if you’re done with them.

Cleaning up training, tuning, and batch jobs


Audit with CLI

BASH

# Custom training jobs
gcloud ai custom-jobs list --region=us-central1
# Hyperparameter tuning jobs
gcloud ai hp-tuning-jobs list --region=us-central1
# Batch prediction jobs
gcloud ai batch-prediction-jobs list --region=us-central1

Stop/delete as needed

BASH

# Example: cancel a custom job
gcloud ai custom-jobs cancel JOB_ID --region=us-central1
# Delete a completed job you no longer need to retain
gcloud ai custom-jobs delete JOB_ID --region=us-central1

Tip: Keep one “golden” successful job per experiment, then remove the rest to reduce console clutter and artifact storage.

Undeploy models and delete endpoints (major cost pitfall)


Find endpoints and deployed models

BASH

gcloud ai endpoints list --region=us-central1
gcloud ai endpoints describe ENDPOINT_ID --region=us-central1

Undeploy and delete

BASH

# Undeploy the model from the endpoint (stops node-hour charges)
gcloud ai endpoints undeploy-model ENDPOINT_ID   --deployed-model-id=DEPLOYED_MODEL_ID   --region=us-central1   --quiet

# Delete the endpoint if you no longer need it
gcloud ai endpoints delete ENDPOINT_ID --region=us-central1 --quiet

Model Registry: If you keep models registered but don’t serve them, you won’t pay endpoint node‑hours. Periodically prune stale model versions to reduce storage.

GCS housekeeping (lifecycle policies, versioning, egress)


Quick size & contents

BASH

# Human-readable bucket size
gsutil du -sh gs://YOUR_BUCKET
# List recursively
gsutil ls -r gs://YOUR_BUCKET/** | head -n 50

Lifecycle policy example

Keep workshop artifacts tidy by auto‑deleting temporary outputs and capping old versions.

  1. Save as lifecycle.json:

JSON

{
  "rule": [
    {
      "action": {"type": "Delete"},
      "condition": {"age": 7, "matchesPrefix": ["tmp/"]}
    },
    {
      "action": {"type": "Delete"},
      "condition": {"numNewerVersions": 3}
    }
  ]
}
  1. Apply to bucket:

BASH

gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET
gsutil lifecycle get gs://YOUR_BUCKET

Egress reminder

Downloading out of GCP (to local machines) incurs egress charges. Prefer in‑cloud training/evaluation and share results via GCS links.

Labels, budgets, and cost visibility


Standardize labels on all resources

Use the same labels everywhere (notebooks, jobs, buckets) so billing exports can attribute costs.

  • Examples: owner=yourname, team=ml-workshop, purpose=titanic-demo, env=dev

  • CLI examples:

    BASH

    # Add labels to a custom job on creation (Python SDK supports labels, too)
    # gcloud example when applicable:
    gcloud ai custom-jobs create --labels=owner=yourname,purpose=titanic-demo ...

Set budgets & alerts

  • In Billing → Budgets & alerts, create a budget for your project with thresholds (e.g., 50%, 80%, 100%).
  • Add forecast‑based alerts to catch trends early (e.g., projected to exceed budget).
  • Send email to multiple maintainers (not just you).

Enable billing export (optional but powerful)

  • Export billing to BigQuery to slice by service, label, or SKU.
  • Build a simple Data Studio/Looker Studio dashboard for workshop visibility.

Monitoring and alerts (catch leaks quickly)


  • Cloud Monitoring dashboards: Track notebook VM uptime, endpoint deployment counts, and job error rates.
  • Alerting policies: Trigger notifications when:
    • A Workbench runtime has been running > N hours outside workshop hours.
    • An endpoint node count > 0 for > 60 minutes after a workshop ends.
    • Spend forecast exceeds budget threshold.

Keep alerts few and actionable. Route to email or Slack (via webhook) where your team will see them.

Quotas and guardrails


  • Quotas (IAM & Admin → Quotas): cap GPU count, custom job limits, and endpoint nodes to protect budgets.
  • IAM: least privilege for service accounts used by notebooks and jobs; avoid wide Editor grants.
  • Org policies (if available): disallow costly regions/accelerators you don’t plan to use.

Automating the boring parts


Nightly auto‑stop for idle notebooks

Use Cloud Scheduler to run a daily command that stops notebooks after hours.

BASH

# Cloud Scheduler job (runs daily 22:00) to stop a specific managed runtime
gcloud scheduler jobs create http stop-runtime-job   --schedule="0 22 * * *"   --uri="https://notebooks.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/runtimes/RUNTIME_NAME:stop"   --http-method=POST   --oidc-service-account-email=SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com

Alternative: call gcloud notebooks runtimes list in a small Cloud Run job, filter by last_active_time, and stop any runtime idle > 2h.

Weekly endpoint sweep

  • List endpoints; undeploy any with zero recent traffic (check logs/metrics), then delete stale endpoints.
  • Scriptable with gcloud ai endpoints list/describe in Cloud Run or Cloud Functions on a schedule.

Common pitfalls and quick fixes


  • Forgotten endpointsUndeploy models; delete endpoints you don’t need.
  • Notebook left running all weekend → Enable Idle shutdown; schedule nightly stop.
  • Duplicate datasets across buckets/regions → consolidate; set lifecycle to purge tmp/.
  • Too many parallel HPT trials → cap parallel_trial_count (2–4) and increase max_trial_count gradually.
  • Orphaned artifacts in Artifact Registry/GCS → prune old images/artifacts after promoting a single “golden” run.
Challenge

Challenge 1 — Find and stop idle notebooks

List your notebooks and identify any runtime/instance that has likely been idle for >2 hours. Stop it via CLI.

Hints: gcloud notebooks runtimes list, gcloud notebooks instances list, ... stop

Use gcloud notebooks runtimes list --location=REGION (Managed) or gcloud notebooks instances list --location=ZONE (User‑Managed) to find candidates, then stop them with the corresponding ... stop command.

Challenge

Challenge 2 — Write a lifecycle policy

Create and apply a lifecycle rule that (a) deletes objects under tmp/ after 7 days, and (b) retains only 3 versions of any object.

Hint: gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET

Use the JSON policy shown above, then run gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET and verify with gsutil lifecycle get ....

Challenge

Challenge 3 — Endpoint sweep

List deployed endpoints in your region, undeploy any model you don’t need, and delete the endpoint if it’s no longer required.

Hints: gcloud ai endpoints list, ... describe, ... undeploy-model, ... delete

gcloud ai endpoints list --region=REGION → pick ENDPOINT_IDgcloud ai endpoints undeploy-model ENDPOINT_ID --deployed-model-id=DEPLOYED_MODEL_ID --region=REGION --quiet → if not needed, gcloud ai endpoints delete ENDPOINT_ID --region=REGION --quiet.

Key Points
  • Endpoints and running notebooks are the most common cost leaks; undeploy/stop first.
  • Prefer Managed Notebooks with Idle shutdown; schedule nightly auto‑stop.
  • Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
  • Standardize labels, set budgets, and enable billing export for visibility.
  • Use gcloud/gsutil to audit and clean quickly; automate with Scheduler + Cloud Run/Functions.