Content from Overview of Google Cloud Vertex AI


Last updated on 2025-08-27 | Edit this page

Google Cloud Vertex AI is a unified machine learning (ML) platform that enables users to build, train, tune, and deploy models at scale without needing to manage underlying infrastructure. By integrating data storage, training, tuning, and deployment workflows into one managed environment, Vertex AI supports researchers and practitioners in focusing on their ML models while leveraging Google Cloud’s compute and storage resources.

Overview

Questions

  • What problem does Google Cloud Vertex AI aim to solve?
  • How does Vertex AI simplify machine learning workflows compared to running them on your own?

Objectives

  • Understand the basic purpose of Vertex AI in the ML lifecycle.
  • Recognize how Vertex AI reduces infrastructure and orchestration overhead.

Why use Vertex AI for machine learning?

Vertex AI provides several advantages that make it an attractive option for research and applied ML:

  • Streamlined ML/AI Pipelines: Traditional HPC/HTC environments often require researchers to split workflows into many batch jobs, manually handling dependencies and orchestration. Vertex AI reduces this overhead by managing the end-to-end pipeline (data prep, training, evaluation, tuning, and deployment) within a single environment, making it easier to iterate and scale ML experiments.

  • Flexible compute options: Vertex AI lets you select the right hardware for your workload:

    • CPU (e.g., n1-standard-4, e2-standard-8): Good for small datasets, feature engineering, and inference tasks.
    • GPU (e.g., NVIDIA T4, V100, A100): Optimized for deep learning training and large-scale experimentation.
    • Memory-optimized machine types (e.g., m1-ultramem): Useful for workloads requiring large in-memory datasets, such as transformer models.
  • Parallelized training and tuning: Vertex AI supports distributed training across multiple nodes and automated hyperparameter tuning (Bayesian or grid search). This makes it easier to explore many configurations with minimal custom code while leveraging scalable infrastructure.

  • Custom training support: Vertex AI includes built-in algorithms and frameworks (e.g., scikit-learn, XGBoost, TensorFlow, PyTorch), but it also supports custom containers. Researchers can bring their own scripts or Docker images to run specialized workflows with full control.

  • Cost management and monitoring: Google Cloud provides detailed cost tracking and monitoring via the Billing console and Vertex AI dashboard. Vertex AI also integrates with Cloud Monitoring to help track resource usage. With careful configuration, training 100 small-to-medium models (logistic regression, random forests, or lightweight neural networks on datasets under 10GB) can cost under $20, similar to AWS.

In summary, Vertex AI is Google Cloud’s managed machine learning platform that simplifies the end-to-end ML lifecycle. It eliminates the need for manual orchestration in research computing environments by offering integrated workflows, scalable compute, and built-in monitoring. With flexible options for CPUs, GPUs, and memory-optimized hardware, plus strong support for both built-in and custom training, Vertex AI enables researchers to move quickly from experimentation to production while keeping costs predictable and manageable.

Discussion

Infrastructure Choices for ML

At your institution (or in your own work), what infrastructure options are currently available for running ML experiments?
- Do you typically use a laptop/desktop, HPC cluster, or cloud?
- What are the advantages and drawbacks of your current setup compared to a managed service like Vertex AI?
- If you could offload one infrastructure challenge (e.g., provisioning GPUs, handling dependencies, monitoring costs), what would it be and why?

Take 3–5 minutes to discuss with a partner or share in the workshop chat.

Key Points
  • Vertex AI simplifies ML workflows by integrating data, training, tuning, and deployment in one managed platform.
  • It reduces the need for manual orchestration compared to traditional research computing environments.
  • Cost monitoring and resource tracking help keep cloud usage affordable for research projects.

Content from Data Storage: Setting up GCS


Last updated on 2025-08-27 | Edit this page

Overview

Questions

  • How can I store and manage data effectively in GCP for Vertex AI workflows?
  • What are the advantages of Google Cloud Storage (GCS) compared to local or VM storage for machine learning projects?

Objectives

  • Explain data storage options in GCP for machine learning projects.
  • Describe the advantages of GCS for large datasets and collaborative workflows.
  • Outline steps to set up a GCS bucket and manage data within Vertex AI.

Storing data on GCP


Machine learning and AI projects rely on data, making efficient storage and management essential. Google Cloud offers several storage options, but the most common for ML workflows are persistent disks (attached to Compute Engine VMs or Vertex AI Workbench) and Google Cloud Storage (GCS) buckets.

Consult your institution’s IT before handling sensitive data in GCP

As with AWS, do not upload restricted or sensitive data to GCP services unless explicitly approved by your institution’s IT or cloud security team. For regulated datasets (HIPAA, FERPA, proprietary), work with your institution to ensure encryption, restricted access, and compliance with policies.

Options for storage: VM Disks or GCS


What is a VM persistent disk?

A persistent disk is the storage volume attached to a Compute Engine VM or a Vertex AI Workbench notebook. It can store datasets and intermediate results, but it is tied to the lifecycle of the VM.

When to store data directly on a persistent disk

  • Useful for small, temporary datasets processed interactively.
  • Data persists if the VM is stopped, but storage costs continue as long as the disk exists.
  • Not ideal for collaboration, scaling, or long-term dataset storage.
Callout

Limitations of persistent disk storage

  • Scalability: Limited by disk size quota.
  • Sharing: Harder to share across projects or team members.
  • Cost: More expensive per GB compared to GCS for long-term storage.

What is a GCS bucket?

For most ML workflows in Vertex AI, Google Cloud Storage (GCS) buckets are recommended. A GCS bucket is a container in Google’s object storage service where you can store an essentially unlimited number of files. Data in GCS can be accessed from Vertex AI training jobs, Workbench notebooks, and other GCP services using a GCS URI (e.g., gs://your-bucket-name/your-file.csv).


To upload our Titanic dataset to a GCS bucket, we’ll follow these steps:

  1. Log in to the Google Cloud Console.
  2. Create a new bucket (or use an existing one).
  3. Upload your dataset files.
  4. Use the GCS URI to reference your data in Vertex AI workflows.

Detailed procedure

1. Sign in to Google Cloud Console
  • In the search bar, type Storage.
  • Click Cloud Storage > Buckets.
3. Create a new bucket
  • Click Create bucket.
  • Enter a globally unique name (e.g., yourname-titanic-gcs).
  • Choose a location type:
    • Region (cheapest, good default).
    • Multi-region (higher redundancy, more expensive).
  • Access Control: Recommended: Uniform access with IAM.
  • Public Access: Block public access unless explicitly needed.
  • Versioning: Disable unless you want to keep multiple versions of files.
  • Labels (tags): Add labels to track project usage (e.g., purpose=titanic-dataset, owner=yourname).
4. Set bucket permissions
  • By default, only project members can access.
  • To grant Vertex AI service accounts access, assign the Storage Object Admin or Storage Object Viewer role at the bucket level.
5. Upload files to the bucket
  • If you haven’t downloaded them yet, right-click and save as .csv:
  • In the bucket dashboard, click Upload Files.
  • Select your Titanic CSVs and upload.
6. Note the GCS URI for your data
  • After uploading, click on a file and find its gs:// URI (e.g., gs://yourname-titanic-gcs/titanic_train.csv).
  • This URI will be used when launching Vertex AI training jobs.

GCS bucket costs


GCS costs are based on storage class, data transfer, and operations (requests).

Storage costs

  • Standard storage (us-central1): ~$0.02 per GB per month.
  • Other classes (Nearline, Coldline, Archive) are cheaper but with retrieval costs.

Data transfer costs

  • Uploading data into GCS is free.
  • Downloading data out of GCP costs ~$0.12 per GB.
  • Accessing data within the same region is free.

Request costs

  • GET (read) requests: ~$0.004 per 10,000 requests.
  • PUT (write) requests: ~$0.05 per 10,000 requests.

For detailed pricing, see GCS Pricing Information.

Challenge

Challenge: Estimating Storage Costs

1. Estimate the total cost of storing 1 GB in GCS Standard storage (us-central1) for one month assuming:
- Storage duration: 1 month
- Dataset retrieved 100 times for model training and tuning
- Data is downloaded once out of GCP at the end of the project

Hints
- Storage cost: $0.02 per GB per month
- Egress (download out of GCP): $0.12 per GB
- GET requests: $0.004 per 10,000 requests (100 requests ≈ free for our purposes)

2. Repeat the above calculation for datasets of 10 GB, 100 GB, and 1 TB (1024 GB).

  1. 1 GB:
  • Storage: 1 GB × $0.02 = $0.02
  • Egress: 1 GB × $0.12 = $0.12
  • Requests: ~0 (100 reads well below pricing tier)
  • Total: $0.14
  1. 10 GB:
  • Storage: 10 GB × $0.02 = $0.20
  • Egress: 10 GB × $0.12 = $1.20
  • Requests: ~0
  • Total: $1.40
  1. 100 GB:
  • Storage: 100 GB × $0.02 = $2.00
  • Egress: 100 GB × $0.12 = $12.00
  • Requests: ~0
  • Total: $14.00
  1. 1 TB (1024 GB):
  • Storage: 1024 GB × $0.02 = $20.48
  • Egress: 1024 GB × $0.12 = $122.88
  • Requests: ~0
  • Total: $143.36

Removing unused data (complete after the workshop)


After you are done using your data, remove unused files/buckets to stop costs:

  • Option 1: Delete files only – if you plan to reuse the bucket.
  • Option 2: Delete the bucket entirely – if you no longer need it.
Key Points
  • Use GCS for scalable, cost-effective, and persistent storage in GCP.
  • Persistent disks are suitable only for small, temporary datasets.
  • Track your storage, transfer, and request costs to manage expenses.
  • Regularly delete unused data or buckets to avoid ongoing costs.

Content from Notebooks as Controllers


Last updated on 2025-08-27 | Edit this page

Overview

Questions

  • How do you set up and use Vertex AI Workbench notebooks for machine learning tasks?
  • How can you manage compute resources efficiently using a “controller” notebook approach in GCP?

Objectives

  • Describe how to use Vertex AI Workbench notebooks for ML workflows.
  • Set up a Jupyter-based Workbench instance as a controller to manage compute tasks.
  • Use the Vertex AI SDK to launch training and tuning jobs on scalable instances.

Setting up our notebook environment


Google Cloud Vertex AI provides a managed environment for building, training, and deploying machine learning models. In this episode, we’ll set up a Vertex AI Workbench notebook instance—a Jupyter-based environment hosted on GCP that integrates seamlessly with other Vertex AI services.

Using the notebook as a controller

The notebook instance functions as a controller to manage more resource-intensive tasks. By selecting a modest machine type (e.g., n1-standard-4), you can perform lightweight operations locally in the notebook while using the Vertex AI Python SDK to launch compute-heavy jobs on larger machines (e.g., GPU-accelerated) when needed.

This approach minimizes costs while giving you access to scalable infrastructure for demanding tasks like model training, batch prediction, and hyperparameter tuning.

We’ll follow these steps to create our first Vertex AI Workbench notebook:

  • In the Google Cloud Console, search for Vertex AI Workbench.
  • Pin it to your navigation bar for quick access.

2. Create a new notebook instance

  • Click New Notebook.
  • Choose Managed Notebooks (recommended for workshops and shared environments).
  • Notebook name: Use a naming convention like yourname-explore-vertexai.
  • Machine type: Select a small machine (e.g., n1-standard-4) to act as the controller.
    • This keeps costs low while you delegate heavy lifting to Vertex AI training jobs.
    • For guidance on common machine types for ML procedures, refer to our supplemental Instances for ML on GCP.
  • GPUs: Leave disabled for now (training jobs will request them separately).
  • Permissions: The project’s default service account is usually sufficient. It must have access to GCS and Vertex AI.
  • Networking and encryption: Leave default unless required by your institution.
  • Labels: Add labels for cost tracking (e.g., purpose=workshop, owner=yourname).

Once created, your notebook instance will start in a few minutes. When its status is Running, you can open JupyterLab and begin working.

Managing training and tuning with the controller notebook

In the following episodes, we’ll use the Vertex AI Python SDK (google-cloud-aiplatform) from this notebook to submit compute-heavy tasks on more powerful machines. Examples include:

  • Training a model: Submit a training job to Vertex AI with a higher-powered instance (e.g., n1-highmem-32 or GPU-backed machines).
  • Hyperparameter tuning: Configure and submit a tuning job, allowing Vertex AI to manage multiple parallel trials automatically.

This pattern keeps costs low by running your notebook on a modest VM while only incurring charges for larger resources when they’re actively in use.

Challenge

Challenge: Notebook Roles

Your university provides different compute options: laptops, on-prem HPC, and GCP.

  • What role does a Vertex AI Workbench notebook play compared to an HPC login node or a laptop-based JupyterLab?
  • Which tasks should stay in the notebook (lightweight control, visualization) versus being launched to larger cloud resources?

The notebook serves as a lightweight control plane.
- Like an HPC login node, it’s not meant for heavy computation.
- Suitable for small preprocessing, visualization, and orchestrating jobs.
- Resource-intensive tasks (training, tuning, batch jobs) should be submitted to scalable cloud resources (GPU/large VM instances) via the Vertex AI SDK.

Key Points
  • Use a small Vertex AI Workbench notebook instance as a controller to manage larger, resource-intensive tasks.
  • Submit training and tuning jobs to scalable instances using the Vertex AI SDK.
  • Labels help track costs effectively, especially in shared or multi-project environments.
  • Vertex AI Workbench integrates directly with GCS and Vertex AI services, making it a hub for ML workflows.

Content from Accessing and Managing Data in GCS with Vertex AI Notebooks


Last updated on 2025-08-27 | Edit this page

Overview

Questions

  • How can I load data from GCS into a Vertex AI Workbench notebook?
  • How do I monitor storage usage and costs for my GCS bucket?
  • What steps are involved in pushing new data back to GCS from a notebook?

Objectives

  • Read data directly from a GCS bucket into memory in a Vertex AI notebook.
  • Check storage usage and estimate costs for data in a GCS bucket.
  • Upload new files from the Vertex AI environment back to the GCS bucket.

Initial setup


Open JupyterLab notebook

Once your Vertex AI Workbench notebook instance shows as Running, open it in JupyterLab. Create a new Python 3 notebook and rename it to: Interacting-with-GCS.ipynb.

Set up GCP environment

Before interacting with GCS, we need to authenticate and initialize the client libraries. This ensures our notebook can talk to GCP securely.

PYTHON

from google.cloud import storage
from google.colab import auth
import pandas as pd

# Step 1: Authenticate your account (only prompts if needed)
auth.authenticate_user()

# Step 2: Initialize a GCS client
client = storage.Client()

# Step 3: List buckets in your current project to confirm access
buckets = list(client.list_buckets())
print("Buckets in project:")
for b in buckets:
    print("-", b.name)

Explanation of the pieces:
- auth.authenticate_user(): Ensures you are logged in to your Google account and the notebook can act on your behalf. In Workbench, this usually auto-resolves.
- storage.Client(): Creates a connection to Google Cloud Storage. All read/write actions will use this client.
- list_buckets(): Confirms which storage buckets your account can see in the current project.

This setup block prepares the notebook environment to efficiently interact with GCS resources.

Reading data from GCS


As with S3, you can either (A) read data directly from GCS into memory, or (B) download a copy into your notebook VM. Since we’re using notebooks as controllers rather than training environments, the recommended approach is reading directly from GCS.

A) Reading data directly into memory

PYTHON

bucket_name = "yourname-titanic-gcs"
blob_name = "titanic_train.csv"

bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
train_data = pd.read_csv(pd.io.common.BytesIO(data_bytes))

print(train_data.shape)
train_data.head()

B) Downloading a local copy

PYTHON

bucket_name = "yourname-titanic-gcs"
blob_name = "titanic_train.csv"
local_path = "/home/jupyter/titanic_train.csv"

bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.download_to_filename(local_path)

!ls -lh /home/jupyter/

Checking storage usage of a bucket


PYTHON

total_size_bytes = 0
bucket = client.bucket(bucket_name)

for blob in client.list_blobs(bucket_name):
    total_size_bytes += blob.size

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{bucket_name}': {total_size_mb:.2f} MB")

Estimating storage costs


PYTHON

storage_price_per_gb = 0.02  # $/GB/month for Standard storage
total_size_gb = total_size_bytes / (1024**3)
monthly_cost = total_size_gb * storage_price_per_gb

print(f"Estimated monthly cost: ${monthly_cost:.4f}")
print(f"Estimated annual cost: ${monthly_cost*12:.4f}")

For updated prices, see GCS Pricing.

Writing output files to GCS


PYTHON

# Create a sample file
with open("Notes.txt", "w") as f:
    f.write("This is a test note for GCS.")

# Upload to bucket/docs/
bucket = client.bucket(bucket_name)
blob = bucket.blob("docs/Notes.txt")
blob.upload_from_filename("Notes.txt")

print("File uploaded successfully.")

List bucket contents:

PYTHON

for blob in client.list_blobs(bucket_name):
    print(blob.name)
Challenge

Challenge: Estimating GCS Costs

Suppose you store 50 GB of data in Standard storage (us-central1) for one month.
- Estimate the monthly storage cost.
- Then estimate the cost if you download (egress) the entire dataset once at the end of the month.

Hints
- Storage: $0.02 per GB-month
- Egress: $0.12 per GB

  • Storage cost: 50 GB × $0.02 = $1.00
  • Egress cost: 50 GB × $0.12 = $6.00
  • Total cost: $7.00 for one month including one full download
Key Points
  • Load data from GCS into memory to avoid managing local copies when possible.
  • Periodically check storage usage and costs to manage your GCS budget.
  • Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.

Content from Using a GitHub Personal Access Token (PAT) to Push/Pull from a Vertex AI Notebook


Last updated on 2025-08-27 | Edit this page

Overview

Questions

  • How can I securely push/pull code to and from GitHub within a Vertex AI Workbench notebook?
  • What steps are necessary to set up a GitHub PAT for authentication in GCP?
  • How can I convert notebooks to .py files and ignore .ipynb files in version control?

Objectives

  • Configure Git in a Vertex AI Workbench notebook to use a GitHub Personal Access Token (PAT) for HTTPS-based authentication.
  • Securely handle credentials in a notebook environment using getpass.
  • Convert .ipynb files to .py files for better version control practices in collaborative projects.

Step 0: Initial setup


In the previous episode, we cloned our forked repository as part of the workshop setup. In this episode, we’ll see how to push our code to this fork. Complete these three setup steps before moving forward.

  1. Clone the fork if you haven’t already. See previous episode.

  2. Start a new Jupyter notebook, and name it something like Interacting-with-git.ipynb. We can use the default Python 3 kernel in Vertex AI Workbench.

  3. Change directory to the workspace where your repository is located. In Vertex AI Workbench, notebooks usually live under /home/jupyter/.

PYTHON

%cd /home/jupyter/

Step 1: Using a GitHub personal access token (PAT) to push/pull from a Vertex AI notebook


When working in Vertex AI Workbench notebooks, you may often need to push code updates to GitHub repositories. Since Workbench VMs may be stopped and restarted, configurations like SSH keys may not persist. HTTPS-based authentication with a GitHub Personal Access Token (PAT) is a practical solution. PATs provide flexibility for authentication and enable seamless interaction with both public and private repositories directly from your notebook.

Important Note: Personal access tokens are powerful credentials. Select the minimum necessary permissions and handle the token carefully.

Generate a personal access token (PAT) on GitHub

  1. Go to Settings in GitHub.
  2. Click Developer settings at the bottom of the left sidebar.
  3. Select Personal access tokens, then click Tokens (classic).
  4. Click Generate new token (classic).
  5. Give your token a descriptive name and set an expiration date if desired.
  6. Select minimum permissions:
    • Public repos: public_repo
    • Private repos: repo
  7. Click Generate token and copy it immediately—you won’t be able to see it again.

Caution: Treat your PAT like a password. Don’t share it or expose it in your code. Use a password manager to store it.

Use getpass to prompt for username and PAT

PYTHON

import getpass

# Prompt for GitHub username and PAT securely
username = input("GitHub Username: ")
token = getpass.getpass("GitHub Personal Access Token (PAT): ")

This way credentials aren’t hard-coded into your notebook.

Step 2: Configure Git settings


PYTHON

!git config --global user.name "Your Name" 
!git config --global user.email your_email@wisc.edu
  • user.name: Will appear in the commit history.
  • user.email: Must match your GitHub account so commits are linked to your profile.

Step 3: Convert .ipynb notebooks to .py


Tracking .py files instead of .ipynb helps with cleaner version control. Notebooks store outputs and metadata, which makes diffs noisy. .py files are lighter and easier to review.

  1. Install Jupytext.

PYTHON

!pip install jupytext
  1. Convert a notebook to .py.

PYTHON

!jupytext --to py Interacting-with-GCS.ipynb
  1. Convert all notebooks in the current directory.

PYTHON

import subprocess, os

for nb in [f for f in os.listdir() if f.endswith('.ipynb')]:
    pyfile = nb.replace('.ipynb', '.py')
    subprocess.run(["jupytext", "--to", "py", nb, "--output", pyfile])
    print(f"Converted {nb} to {pyfile}")

Step 4: Add and commit .py files


PYTHON

%cd /home/jupyter/your-repo
!git status
!git add .
!git commit -m "Converted notebooks to .py files for version control"

Step 5: Add .ipynb to .gitignore


PYTHON

!touch .gitignore
with open(".gitignore", "a") as gitignore:
    gitignore.write("\n# Ignore Jupyter notebooks\n*.ipynb\n")
!cat .gitignore

Add other temporary files too:

PYTHON

with open(".gitignore", "a") as gitignore:
    gitignore.write("\n# Ignore cache and temp files\n__pycache__/\n*.tmp\n*.log\n")

Commit the .gitignore:

PYTHON

!git add .gitignore
!git commit -m "Add .ipynb and temp files to .gitignore"

Step 6: Syncing with GitHub


First, pull the latest changes:

PYTHON

!git config pull.rebase false
!git pull origin main

If conflicts occur, resolve manually before committing.

Then push with your PAT credentials:

PYTHON

github_url = f'github.com/{username}/your-repo.git'
!git push https://{username}:{token}@{github_url} main

Step 7: Convert .py back to notebooks (optional)


To convert .py files back to .ipynb after pulling updates:

PYTHON

!jupytext --to notebook Interacting-with-GCS.py --output Interacting-with-GCS.ipynb
Challenge

Challenge: GitHub PAT Workflow

  • Why might you prefer using a PAT with HTTPS instead of SSH keys in Vertex AI Workbench?
  • What are the benefits of converting .ipynb files to .py before committing to a shared repo?
  • PATs with HTTPS are easier to set up in temporary environments where SSH configs don’t persist.
  • Converting notebooks to .py results in cleaner diffs, easier code review, and smaller repos without stored outputs/metadata.
Key Points
  • Use a GitHub PAT for HTTPS-based authentication in Vertex AI Workbench notebooks.
  • Securely enter sensitive information in notebooks using getpass.
  • Converting .ipynb files to .py files helps with cleaner version control.
  • Adding .ipynb files to .gitignore keeps your repository organized.

Content from Training Models in Vertex AI: Intro


Last updated on 2025-08-27 | Edit this page

Overview

Questions

  • What are the differences between training locally in a Vertex AI notebook and using Vertex AI-managed training jobs?
  • How do custom training jobs in Vertex AI streamline the training process for various frameworks?
  • How does Vertex AI handle scaling across CPUs, GPUs, and TPUs?

Objectives

  • Understand the difference between local training in a Vertex AI Workbench notebook and submitting managed training jobs.
  • Learn to configure and use Vertex AI custom training jobs for different frameworks (e.g., XGBoost, PyTorch, SKLearn).
  • Understand scaling options in Vertex AI, including when to use CPUs, GPUs, or TPUs.
  • Compare performance, cost, and setup between custom scripts and pre-built containers in Vertex AI.
  • Conduct training with data stored in GCS and monitor training job status using the Google Cloud Console.

Initial setup


1. Open a new .ipynb notebook

Open a fresh Jupyter notebook inside your Vertex AI Workbench instance. You can name it something along the lines of, Training-models.ipynb.

2. CD to instance home directory

So we all can reference helper functions consistently, change directory to your Jupyter home directory.

PYTHON

%cd /home/jupyter/

3. Initialize Vertex AI environment

This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.

PYTHON

from google.cloud import aiplatform
import pandas as pd

# Set your project and region (replace with your values)
PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-gcs-bucket"

# Initialize Vertex AI client
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")
  • aiplatform.init(): Sets defaults for project, region, and staging bucket.
  • PROJECT_ID: Identifies your GCP project.
  • REGION: Determines where training jobs run (choose a region close to your data).
  • staging_bucket: A GCS bucket for storing datasets, model artifacts, and job outputs.

4. Get code from GitHub repo (skip if already completed)

If you didn’t complete earlier episodes, clone our code repo before moving forward. Check to make sure we’re in our Jupyter home folder first.

PYTHON

%cd /home/jupyter/

PYTHON

# Uncomment below line only if you still need to download the code repo (replace username with your GitHub username)
#!git clone https://github.com/username/GCP_helpers.git

Testing train.py locally in the notebook


Before scaling training jobs onto managed resources, it’s essential to test your training script locally. This prevents wasting GPU/TPU time on bugs or misconfigured code.

Guidelines for testing ML pipelines before scaling

  • Run tests locally first with small datasets.
  • Use a subset of your dataset (1–5%) for fast checks.
  • Start with minimal compute before moving to larger accelerators.
  • Log key metrics such as loss curves and runtimes.
  • Verify correctness first before scaling up.
Discussion

What tests should we do before scaling?

Before scaling to multiple or more powerful instances (e.g., GPUs or TPUs), it’s important to run a few sanity checks. In your group, discuss:

  • Which checks do you think are most critical before scaling up?
  • What potential issues might we miss if we skip this step?
  • Data loads correctly – dataset loads without errors, expected columns exist, missing values handled.
  • Overfitting check – train on a tiny dataset (e.g., 100 rows). If it doesn’t overfit, something is off.
  • Loss behavior – verify training loss decreases and doesn’t diverge.
  • Runtime estimate – get a rough sense of training time on small data.
  • Memory estimate – check approximate memory use.
  • Save & reload – ensure model saves, reloads, and infers without errors.

Skipping these can lead to: silent data bugs, runtime blowups at scale, inefficient experiments, or broken model artifacts.

Download data into notebook environment


Sometimes it’s helpful to keep a copy of data in your notebook VM for quick iteration, even though GCS is the preferred storage location.

PYTHON

from google.cloud import storage

client = storage.Client()
bucket = client.bucket(BUCKET_NAME)

blob = bucket.blob("titanic_train.csv")
blob.download_to_filename("titanic_train.csv")

print("Downloaded titanic_train.csv")

Repeat for the test dataset as needed.

Logging runtime & instance info


When comparing runtimes later, it’s useful to know what instance type you ran on. For Workbench:

PYTHON

!cat /sys/class/dmi/id/product_name

This prints the machine type backing your notebook.

Local test run of train.py


PYTHON

import time as t

start = t.time()

# Example: run your custom training script with args
!python GCP_helpers/train_xgboost.py --max_depth 3 --eta 0.1 --subsample 0.8 --colsample_bytree 0.8 --num_round 100 --train titanic_train.csv

print(f"Total local runtime: {t.time() - start:.2f} seconds")

Training on this small dataset should take <1 minute. Log runtime as a baseline.

Training via Vertex AI custom training job


Unlike “local” training, this launches a managed training job that runs on scalable compute. Vertex AI handles provisioning, scaling, logging, and saving outputs to GCS.

Which machine type to start with?

Start with a small CPU machine like n1-standard-4. Only scale up to GPUs/TPUs once you’ve verified your script. See Instances for ML on GCP for guidance.

Creating a custom training job with the SDK

PYTHON

from google.cloud import aiplatform

job = aiplatform.CustomJob(
    display_name="xgboost-train",
    script_path="GCP_helpers/train_xgboost.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.1-5:latest",
    requirements=["pandas", "scikit-learn", "joblib"],
    args=[
        "--max_depth=3",
        "--eta=0.1",
        "--subsample=0.8",
        "--colsample_bytree=0.8",
        "--num_round=100",
        "--train=gs://{}/titanic_train.csv".format(BUCKET_NAME),
    ],
    model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-5:latest",
)

# Run the training job
model = job.run(replica_count=1, machine_type="n1-standard-4")

This launches a managed training job with Vertex AI. Logs and trained models are automatically stored in your GCS bucket.

Monitoring training jobs in the Console


  1. Go to the Google Cloud Console.
  2. Navigate to Vertex AI > Training > Custom Jobs.
  3. Click on your job name to see status, logs, and output model artifacts.
  4. Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).

When training takes too long


Two main options in Vertex AI:

  • Option 1: Upgrade to more powerful machine types (e.g., add GPUs like T4, V100, A100).
  • Option 2: Use distributed training with multiple replicas.

Option 1: Upgrade machine type (preferred first step)

  • Works best for small/medium datasets (<10 GB).
  • Avoids the coordination overhead of distributed training.
  • GPUs/TPUs accelerate deep learning tasks significantly.

Option 2: Distributed training with multiple replicas

  • Supported in Vertex AI for many frameworks.
  • Split data across replicas, each trains a portion, gradients synchronized.
  • More beneficial for very large datasets and long-running jobs.

When distributed training makes sense

  • Dataset >10–50 GB.
  • Training time >10 hours on single machine.
  • Deep learning workloads that naturally parallelize across GPUs/TPUs.
Key Points
  • Environment initialization: Use aiplatform.init() to set defaults for project, region, and bucket.
  • Local vs managed training: Test locally before scaling into managed jobs.
  • Custom jobs: Vertex AI lets you run scripts as managed training jobs using pre-built or custom containers.
  • Scaling: Start small, then scale up to GPUs or distributed jobs as dataset/model size grows.
  • Monitoring: Track job logs and artifacts in the Vertex AI Console.

Content from Training Models in Vertex AI: PyTorch Example


Last updated on 2025-08-27 | Edit this page

Overview

Questions

  • When should you consider using a GPU or TPU instance for training neural networks in Vertex AI, and what are the benefits and limitations?
  • How does Vertex AI handle distributed training, and which approaches are suitable for typical neural network training?

Objectives

  • Preprocess the Titanic dataset for efficient training using PyTorch.
  • Save and upload training and validation data in .npz format to GCS.
  • Understand the trade-offs between CPU, GPU, and TPU training for smaller datasets.
  • Deploy a PyTorch model to Vertex AI and evaluate instance types for training performance.
  • Differentiate between data parallelism and model parallelism, and determine when each is appropriate in Vertex AI.

Initial setup


Open a fresh Jupyter notebook in your Vertex AI Workbench environment (e.g., Training-part2.ipynb). Then initialize your environment:

PYTHON

from google.cloud import aiplatform, storage
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-gcs-bucket"

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")
  • aiplatform.init(): Initializes Vertex AI with project, region, and staging bucket.
  • storage.Client(): Used to upload training data to GCS.

Preparing the data (compressed npz files)


We’ll prepare the Titanic dataset and save as .npz files for efficient PyTorch loading.

PYTHON

# Load and preprocess Titanic dataset
df = pd.read_csv("titanic_train.csv")

df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
df['Embarked'] = df['Embarked'].fillna('S')
df['Embarked'] = LabelEncoder().fit_transform(df['Embarked'])
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Fare'] = df['Fare'].fillna(df['Fare'].median())

X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']].values
y = df['Survived'].values

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

np.savez('train_data.npz', X_train=X_train, y_train=y_train)
np.savez('val_data.npz', X_val=X_val, y_val=y_val)

Upload data to GCS

PYTHON

client = storage.Client()
bucket = client.bucket(BUCKET_NAME)

bucket.blob("train_data.npz").upload_from_filename("train_data.npz")
bucket.blob("val_data.npz").upload_from_filename("val_data.npz")

print("Files uploaded to GCS.")
Callout

Why use .npz?

  • Optimized data loading: Compressed binary format reduces I/O overhead.
  • Batch compatibility: Works seamlessly with PyTorch DataLoader.
  • Consistency: Keeps train/validation arrays structured and organized.
  • Multiple arrays: Stores multiple arrays (X_train, y_train) in one file.

Testing locally in notebook


Before scaling up, test your script locally with fewer epochs:

PYTHON

import torch
import time as t

epochs = 100
learning_rate = 0.001

start_time = t.time()
%run GCP_helpers/train_nn.py --train train_data.npz --val val_data.npz --epochs {epochs} --learning_rate {learning_rate}
print(f"Local training time: {t.time() - start_time:.2f} seconds")

Training via Vertex AI with PyTorch


Vertex AI supports custom training jobs with PyTorch containers.

PYTHON

job = aiplatform.CustomJob(
    display_name="pytorch-train",
    script_path="GCP_helpers/train_nn.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
    requirements=["torch", "pandas", "numpy", "scikit-learn"],
    args=[
        "--train=gs://{}/train_data.npz".format(BUCKET_NAME),
        "--val=gs://{}/val_data.npz".format(BUCKET_NAME),
        "--epochs=1000",
        "--learning_rate=0.001"
    ],
    model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-13:latest",
)

model = job.run(replica_count=1, machine_type="n1-standard-4")

GPU Training in Vertex AI


For small datasets, GPUs may not help. But for larger models/datasets, GPUs (e.g., T4, V100, A100) can reduce training time.

In your training script (train_nn.py), ensure GPU support:

PYTHON

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Then move models and tensors to device.

Distributed training in Vertex AI


Vertex AI supports data and model parallelism.

  • Data parallelism: Common for neural nets; dataset split across replicas; gradients synced.
  • Model parallelism: Splits model across devices, used for very large models.

PYTHON

model = job.run(replica_count=2, machine_type="n1-standard-8", accelerator_type="NVIDIA_TESLA_T4", accelerator_count=1)

Monitoring jobs


  • In the Console: Vertex AI > Training > Custom Jobs.
  • Check logs, runtime, and outputs.
  • Cancel jobs as needed.
Key Points
  • .npz files streamline PyTorch data handling and reduce I/O overhead.
  • GPUs may not speed up small models/datasets due to overhead.
  • Vertex AI supports both CPU and GPU training, with scaling via multiple replicas.
  • Data parallelism splits data, model parallelism splits layers — choose based on model size.
  • Test locally first before launching expensive training jobs.

Content from Hyperparameter Tuning in Vertex AI: Neural Network Example


Last updated on 2025-08-27 | Edit this page

Overview

Questions

  • How can we efficiently manage hyperparameter tuning in Vertex AI?
  • How can we parallelize tuning jobs to optimize time without increasing costs?

Objectives

  • Set up and run a hyperparameter tuning job in Vertex AI.
  • Define search spaces for ContinuousParameter and CategoricalParameter.
  • Log and capture objective metrics for evaluating tuning success.
  • Optimize tuning setup to balance cost and efficiency, including parallelization.

To conduct efficient hyperparameter tuning with neural networks (or any model) in Vertex AI, we’ll use Vertex AI’s Hyperparameter Tuning Jobs. The key is defining a clear search space, ensuring metrics are properly logged, and keeping costs manageable by controlling the number of trials and level of parallelization.

Key steps for hyperparameter tuning

The overall process involves these steps:

  1. Prepare training script and ensure metrics are logged.
  2. Define hyperparameter search space.
  3. Configure a hyperparameter tuning job in Vertex AI.
  4. Set data paths and launch the tuning job.
  5. Monitor progress in the Vertex AI Console.
  6. Extract best model and evaluate.

0. Directory setup

Change directory to your Jupyter home folder.

PYTHON

%cd /home/jupyter/

1. Prepare training script with metric logging

Your training script (train_nn.py) should periodically print validation accuracy in a format that Vertex AI can capture.

PYTHON

if (epoch + 1) % 100 == 0 or epoch == epochs - 1:
    print(f"validation_accuracy: {val_accuracy:.4f}", flush=True)

Vertex AI automatically captures metrics logged in this format (key: value).

2. Define hyperparameter search space

In Vertex AI, you specify hyperparameter ranges when configuring the tuning job. You can define both discrete and continuous ranges.

PYTHON

parameter_spec = {
    "epochs": aiplatform.hyperparameter_tuning_utils.IntegerParameterSpec(min=100, max=1000, scale="linear"),
    "learning_rate": aiplatform.hyperparameter_tuning_utils.DoubleParameterSpec(min=0.001, max=0.1, scale="log")
}
  • IntegerParameterSpec: Defines integer ranges.
  • DoubleParameterSpec: Defines continuous ranges, with optional scaling.

3. Configure hyperparameter tuning job

PYTHON

from google.cloud import aiplatform

job = aiplatform.CustomJob(
    display_name="pytorch-train-hpt",
    script_path="GCP_helpers/train_nn.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
    requirements=["torch", "pandas", "numpy", "scikit-learn"],
    model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-13:latest",
)

hpt_job = aiplatform.HyperparameterTuningJob(
    display_name="pytorch-hpt-job",
    custom_job=job,
    metric_spec={"validation_accuracy": "maximize"},
    parameter_spec=parameter_spec,
    max_trial_count=4,
    parallel_trial_count=2,
)

4. Launch the hyperparameter tuning job

PYTHON

hpt_job.run(
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    args=[
        "--train=gs://{}/train_data.npz".format(BUCKET_NAME),
        "--val=gs://{}/val_data.npz".format(BUCKET_NAME),
        "--epochs=100",
        "--learning_rate=0.001"
    ]
)
  • max_trial_count: Total number of configurations tested.
  • parallel_trial_count: Number of trials run at once (recommend ≤4 to let adaptive search improve).

5. Monitor tuning job in Vertex AI Console

  1. Navigate to Vertex AI > Training > Hyperparameter tuning jobs.
  2. View trial progress, logs, and metrics.
  3. Cancel jobs from the console if needed.

6. Extract and evaluate the best model

PYTHON

best_trial = hpt_job.trials[0]  # Best trial listed first after completion
print("Best hyperparameters:", best_trial.parameters)
print("Best objective value:", best_trial.final_measurement.metrics)

You can then load the best model artifact from the associated GCS path and evaluate on test data.

Discussion

What is the effect of parallelism in tuning?

  • How might running 10 trials in parallel differ from running 2 at a time in terms of cost, time, and quality of results?
  • When would you want to prioritize speed over adaptive search benefits?
Key Points
  • Vertex AI Hyperparameter Tuning Jobs let you efficiently explore parameter spaces using adaptive strategies.
  • Always test with max_trial_count=1 first to confirm your setup works.
  • Limit parallel_trial_count to a small number (2–4) to benefit from adaptive search.
  • Use GCS for input/output and monitor jobs through the Vertex AI Console.

Content from Resource Management & Monitoring on Vertex AI (GCP)


Last updated on 2025-08-27 | Edit this page

Overview

Questions

  • How do I monitor and control Vertex AI, Workbench, and GCS costs day‑to‑day?
  • What specifically should I stop, delete, or schedule to avoid surprise charges?
  • How can I automate cleanup and set alerting so leaks get caught quickly?

Objectives

  • Identify all major cost drivers across Vertex AI (training jobs, endpoints, Workbench notebooks, batch prediction) and GCS.
  • Practice safe cleanup for Managed and User‑Managed Workbench notebooks, training/tuning jobs, batch predictions, models, endpoints, and artifacts.
  • Configure budgets, labels, and basic lifecycle policies to keep costs predictable.
  • Use gcloud/gsutil commands for auditing and rapid cleanup; understand when to prefer the Console.
  • Draft simple automation patterns (Cloud Scheduler + gcloud) to enforce idle shutdown.

What costs you money on GCP (quick map)


  • Vertex AI training jobs (Custom Jobs, Hyperparameter Tuning Jobs) — billed per VM/GPU hour while running.
  • Vertex AI endpoints (online prediction) — billed per node‑hour 24/7 while deployed, even if idle.
  • Vertex AI batch prediction jobs — billed for the job’s compute while running.
  • Vertex AI Workbench notebooks — the backing VM and disk bill while running (and disks bill even when stopped).
  • GCS buckets — storage class, object count/size, versioning, egress, and request ops.
  • Artifact Registry (containers, models) — storage for images and large artifacts.
  • Network egress — downloading data out of GCP (e.g., to your laptop) incurs cost.
  • Logging/Monitoring — high‑volume logs/metrics can add up (rare in small workshops, real in prod).

Rule of thumb: Endpoints left deployed and notebooks left running are the most common surprise bills in education/research settings.

A daily “shutdown checklist” (use now, automate later)


  1. Workbench notebooks — stop the runtime/instance when you’re done.
  2. Custom/HPT jobs — confirm no jobs stuck in RUNNING.
  3. Endpoints — undeploy models and delete unused endpoints.
  4. Batch predictions — ensure no jobs queued or running.
  5. Artifacts — delete large intermediate artifacts you won’t reuse.
  6. GCS — keep only one “source of truth”; avoid duplicate datasets in multiple buckets/regions.

Shutting down Vertex AI Workbench notebooks


Vertex AI has two notebook flavors; follow the matching steps:

  • Console: Vertex AI → WorkbenchManaged notebooks → select runtime → Stop.

  • Idle shutdown: Edit runtime → enable Idle shutdown (e.g., 60–120 min).

  • CLI:

    BASH

    # List managed runtimes (adjust region)
    gcloud notebooks runtimes list --location=us-central1
    # Stop a runtime
    gcloud notebooks runtimes stop RUNTIME_NAME --location=us-central1

User‑Managed Notebooks

  • Console: Vertex AI → WorkbenchUser‑managed notebooks → select instance → Stop.

  • CLI:

    BASH

    # List user-managed instances (adjust zone)
    gcloud notebooks instances list --location=us-central1-b
    # Stop an instance
    gcloud notebooks instances stop INSTANCE_NAME --location=us-central1-b

Disks still cost money while the VM is stopped. Delete old runtimes/instances and their disks if you’re done with them.

Cleaning up training, tuning, and batch jobs


Audit with CLI

BASH

# Custom training jobs
gcloud ai custom-jobs list --region=us-central1
# Hyperparameter tuning jobs
gcloud ai hp-tuning-jobs list --region=us-central1
# Batch prediction jobs
gcloud ai batch-prediction-jobs list --region=us-central1

Stop/delete as needed

BASH

# Example: cancel a custom job
gcloud ai custom-jobs cancel JOB_ID --region=us-central1
# Delete a completed job you no longer need to retain
gcloud ai custom-jobs delete JOB_ID --region=us-central1

Tip: Keep one “golden” successful job per experiment, then remove the rest to reduce console clutter and artifact storage.

Undeploy models and delete endpoints (major cost pitfall)


Find endpoints and deployed models

BASH

gcloud ai endpoints list --region=us-central1
gcloud ai endpoints describe ENDPOINT_ID --region=us-central1

Undeploy and delete

BASH

# Undeploy the model from the endpoint (stops node-hour charges)
gcloud ai endpoints undeploy-model ENDPOINT_ID   --deployed-model-id=DEPLOYED_MODEL_ID   --region=us-central1   --quiet

# Delete the endpoint if you no longer need it
gcloud ai endpoints delete ENDPOINT_ID --region=us-central1 --quiet

Model Registry: If you keep models registered but don’t serve them, you won’t pay endpoint node‑hours. Periodically prune stale model versions to reduce storage.

GCS housekeeping (lifecycle policies, versioning, egress)


Quick size & contents

BASH

# Human-readable bucket size
gsutil du -sh gs://YOUR_BUCKET
# List recursively
gsutil ls -r gs://YOUR_BUCKET/** | head -n 50

Lifecycle policy example

Keep workshop artifacts tidy by auto‑deleting temporary outputs and capping old versions.

  1. Save as lifecycle.json:

JSON

{
  "rule": [
    {
      "action": {"type": "Delete"},
      "condition": {"age": 7, "matchesPrefix": ["tmp/"]}
    },
    {
      "action": {"type": "Delete"},
      "condition": {"numNewerVersions": 3}
    }
  ]
}
  1. Apply to bucket:

BASH

gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET
gsutil lifecycle get gs://YOUR_BUCKET

Egress reminder

Downloading out of GCP (to local machines) incurs egress charges. Prefer in‑cloud training/evaluation and share results via GCS links.

Labels, budgets, and cost visibility


Standardize labels on all resources

Use the same labels everywhere (notebooks, jobs, buckets) so billing exports can attribute costs.

  • Examples: owner=yourname, team=ml-workshop, purpose=titanic-demo, env=dev

  • CLI examples:

    BASH

    # Add labels to a custom job on creation (Python SDK supports labels, too)
    # gcloud example when applicable:
    gcloud ai custom-jobs create --labels=owner=yourname,purpose=titanic-demo ...

Set budgets & alerts

  • In Billing → Budgets & alerts, create a budget for your project with thresholds (e.g., 50%, 80%, 100%).
  • Add forecast‑based alerts to catch trends early (e.g., projected to exceed budget).
  • Send email to multiple maintainers (not just you).

Enable billing export (optional but powerful)

  • Export billing to BigQuery to slice by service, label, or SKU.
  • Build a simple Data Studio/Looker Studio dashboard for workshop visibility.

Monitoring and alerts (catch leaks quickly)


  • Cloud Monitoring dashboards: Track notebook VM uptime, endpoint deployment counts, and job error rates.
  • Alerting policies: Trigger notifications when:
    • A Workbench runtime has been running > N hours outside workshop hours.
    • An endpoint node count > 0 for > 60 minutes after a workshop ends.
    • Spend forecast exceeds budget threshold.

Keep alerts few and actionable. Route to email or Slack (via webhook) where your team will see them.

Quotas and guardrails


  • Quotas (IAM & Admin → Quotas): cap GPU count, custom job limits, and endpoint nodes to protect budgets.
  • IAM: least privilege for service accounts used by notebooks and jobs; avoid wide Editor grants.
  • Org policies (if available): disallow costly regions/accelerators you don’t plan to use.

Automating the boring parts


Nightly auto‑stop for idle notebooks

Use Cloud Scheduler to run a daily command that stops notebooks after hours.

BASH

# Cloud Scheduler job (runs daily 22:00) to stop a specific managed runtime
gcloud scheduler jobs create http stop-runtime-job   --schedule="0 22 * * *"   --uri="https://notebooks.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/runtimes/RUNTIME_NAME:stop"   --http-method=POST   --oidc-service-account-email=SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com

Alternative: call gcloud notebooks runtimes list in a small Cloud Run job, filter by last_active_time, and stop any runtime idle > 2h.

Weekly endpoint sweep

  • List endpoints; undeploy any with zero recent traffic (check logs/metrics), then delete stale endpoints.
  • Scriptable with gcloud ai endpoints list/describe in Cloud Run or Cloud Functions on a schedule.

Common pitfalls and quick fixes


  • Forgotten endpointsUndeploy models; delete endpoints you don’t need.
  • Notebook left running all weekend → Enable Idle shutdown; schedule nightly stop.
  • Duplicate datasets across buckets/regions → consolidate; set lifecycle to purge tmp/.
  • Too many parallel HPT trials → cap parallel_trial_count (2–4) and increase max_trial_count gradually.
  • Orphaned artifacts in Artifact Registry/GCS → prune old images/artifacts after promoting a single “golden” run.
Challenge

Challenge 1 — Find and stop idle notebooks

List your notebooks and identify any runtime/instance that has likely been idle for >2 hours. Stop it via CLI.

Hints: gcloud notebooks runtimes list, gcloud notebooks instances list, ... stop

Use gcloud notebooks runtimes list --location=REGION (Managed) or gcloud notebooks instances list --location=ZONE (User‑Managed) to find candidates, then stop them with the corresponding ... stop command.

Challenge

Challenge 2 — Write a lifecycle policy

Create and apply a lifecycle rule that (a) deletes objects under tmp/ after 7 days, and (b) retains only 3 versions of any object.

Hint: gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET

Use the JSON policy shown above, then run gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET and verify with gsutil lifecycle get ....

Challenge

Challenge 3 — Endpoint sweep

List deployed endpoints in your region, undeploy any model you don’t need, and delete the endpoint if it’s no longer required.

Hints: gcloud ai endpoints list, ... describe, ... undeploy-model, ... delete

gcloud ai endpoints list --region=REGION → pick ENDPOINT_IDgcloud ai endpoints undeploy-model ENDPOINT_ID --deployed-model-id=DEPLOYED_MODEL_ID --region=REGION --quiet → if not needed, gcloud ai endpoints delete ENDPOINT_ID --region=REGION --quiet.

Key Points
  • Endpoints and running notebooks are the most common cost leaks; undeploy/stop first.
  • Prefer Managed Notebooks with Idle shutdown; schedule nightly auto‑stop.
  • Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
  • Standardize labels, set budgets, and enable billing export for visibility.
  • Use gcloud/gsutil to audit and clean quickly; automate with Scheduler + Cloud Run/Functions.