Content from Overview of Google Cloud for Machine Learning
Last updated on 2025-09-22 | Edit this page
Overview
Questions
- What problem does GCP aim to solve for ML researchers?
- How does using a notebook as a controller help organize ML workflows in the cloud?
Objectives
- Understand the basic role of GCP in supporting ML research.
- Recognize how a notebook can serve as a controller for cloud resources.
Google Cloud Platform (GCP) provides the basic building blocks researchers need to run machine learning (ML) experiments at scale. Instead of working only on your laptop or a high-performance computing (HPC) cluster, you can spin up compute resources on demand, store datasets in the cloud, and run notebooks that act as a “controller” for larger training and tuning jobs.
This workshop focuses on using a simple notebook environment as the control center for your ML workflow. We will not rely on Google’s fully managed Vertex AI platform, but instead show how to use core GCP services (Compute Engine, storage buckets, and SDKs) so you can build and run experiments from scratch.
Why use GCP for machine learning?
GCP provides several advantages that make it a strong option for applied ML:
-
Flexible compute: You can choose the hardware that fits your workload:
-
CPUs for lightweight models, preprocessing, or
feature engineering.
-
GPUs (e.g., NVIDIA T4, V100, A100) for training
deep learning models.
- High-memory machines for workloads that need large datasets in memory.
-
CPUs for lightweight models, preprocessing, or
feature engineering.
Data storage and access: Google Cloud Storage (GCS) buckets act like S3 on AWS — an easy way to store and share datasets between experiments and collaborators.
From scratch workflows: Instead of depending on a fully managed ML service, you bring your own frameworks (PyTorch, TensorFlow, scikit-learn, etc.) and run your code the same way you would on your laptop or HPC cluster, but with scalable cloud resources.
Cost visibility: Billing dashboards and project-level budgets make it easier to track costs and stay within research budgets.
In short, GCP provides infrastructure that you control from a notebook environment, allowing you to build and run ML workflows just as you would locally, but with access to scalable hardware and storage.
Comparing infrastructures
Think about your current research setup:
- Do you mostly use your laptop, HPC cluster, or cloud for ML
experiments?
- What benefits would running a cloud-based notebook controller give
you?
- If you could offload one infrastructure challenge (e.g., installing
GPU drivers, managing storage, or setting up environments), what would
it be and why?
Take 3–5 minutes to discuss with a partner or share in the workshop chat.
- GCP provides the core building blocks (compute, storage, networking)
for ML research.
- A notebook can act as a controller to organize cloud workflows and
keep experiments reproducible.
- Using raw infrastructure instead of a fully managed platform gives researchers flexibility while still benefiting from scalable cloud resources.
Content from Data Storage: Setting up GCS
Last updated on 2025-09-22 | Edit this page
Overview
Questions
- How can I store and manage data effectively in GCP for Vertex AI
workflows?
- What are the advantages of Google Cloud Storage (GCS) compared to local or VM storage for machine learning projects?
Objectives
- Explain data storage options in GCP for machine learning
projects.
- Describe the advantages of GCS for large datasets and collaborative
workflows.
- Outline steps to set up a GCS bucket and manage data within Vertex AI.
Storing data on GCP
Machine learning and AI projects rely on data, making efficient storage and management essential. Google Cloud offers several storage options, but the most common for ML workflows are persistent disks (attached to Compute Engine VMs or Vertex AI Workbench) and Google Cloud Storage (GCS) buckets.
Consult your institution’s IT before handling sensitive data in GCP
As with AWS, do not upload restricted or sensitive data to GCP services unless explicitly approved by your institution’s IT or cloud security team. For regulated datasets (HIPAA, FERPA, proprietary), work with your institution to ensure encryption, restricted access, and compliance with policies.
Options for storage: VM Disks or GCS
What is a VM persistent disk?
A persistent disk is the storage volume attached to a Compute Engine VM or a Vertex AI Workbench notebook. It can store datasets and intermediate results, but it is tied to the lifecycle of the VM.
When to store data directly on a persistent disk
- Useful for small, temporary datasets processed interactively.
- Data persists if the VM is stopped, but storage costs continue as
long as the disk exists.
- Not ideal for collaboration, scaling, or long-term dataset storage.
Limitations of persistent disk storage
-
Scalability: Limited by disk size quota.
-
Sharing: Harder to share across projects or team
members.
- Cost: More expensive per GB compared to GCS for long-term storage.
What is a GCS bucket?
For most ML workflows in Vertex AI, Google Cloud Storage
(GCS) buckets are recommended. A GCS bucket is a container in
Google’s object storage service where you can store an essentially
unlimited number of files. Data in GCS can be accessed from Vertex AI
training jobs, Workbench notebooks, and other GCP services using a
GCS URI (e.g.,
gs://your-bucket-name/your-file.csv
).
Benefits of using GCS (recommended for ML workflows)
-
Separation of storage and compute: Data remains
available even if VMs or notebooks are deleted.
-
Easy sharing: Buckets can be accessed by
collaborators with the right IAM roles.
-
Integration with Vertex AI and BigQuery: Read and
write data directly in pipelines.
-
Scalability: Handles datasets of any size without
disk limits.
-
Cost efficiency: Lower cost than persistent disks
for long-term storage.
- Data persistence: Durable and highly available across regions.
Recommended approach: GCS buckets
To upload our Titanic dataset to a GCS bucket, we’ll follow these steps:
- Log in to the Google Cloud Console.
- Create a new bucket (or use an existing one).
- Upload your dataset files.
- Use the GCS URI to reference your data in Vertex AI workflows.
Detailed procedure
1. Sign in to Google Cloud Console
- Go to console.cloud.google.com and log in with your credentials.
3. Create a new bucket
- Click Create bucket.
-
Provide a bucket name: Enter a globally unique
name. For this workshop, we can use the following naming convention to
easily locate our buckets:
lastname_titanic
-
Labels (tags): Add labels to track resource usage
and billing. If you’re working in a shared account, this step is
mandatory. If not, it’s still recommended to help you track
your own costs!
purpose=workshop
data=titanic
-
owner=lastname_firstname
-
Choose a location type: When creating a storage
bucket in Google Cloud, the best practice for most machine learning
workflows is to use a regional bucket in the same region as your compute
resources (for example, us-central1). This setup provides the lowest
latency and avoids network egress charges when training jobs read from
storage, while also keeping costs predictable. A multi-region bucket, on
the other hand, can make sense if your primary goal is broad
availability or if collaborators in different regions need reliable
access to the same data; the trade-off is higher cost and the
possibility of extra egress charges when pulling data into a specific
compute region. For most research projects, a regional bucket with the
Standard storage class, uniform access control, and public access
prevention enabled offers a good balance of performance, security, and
affordability.
- Region (cheapest, good default). For instance, us-central1 (Iowa) costs $0.020 per GB-month.
-
Multi-region (higher redundancy, more
expensive).
-
Choose storage class: When creating a bucket,
you’ll be asked to choose a storage class, which determines how much you
pay for storing data and how often you’re allowed to access it without
extra fees.
- Standard – best for active ML workflows. Training data is read and written often, so this is the safest default.
- Nearline / Coldline / Archive – designed for backups or rarely accessed files. These cost less per GB to store, but you pay retrieval fees if you read them during training. Not recommended for most ML projects where data access is frequent.
- Autoclass – automatically moves objects between Standard and lower-cost classes based on activity. Useful if your usage is unpredictable, but can make cost tracking harder.
- Choose how to control access to objects: By default, you should prevent public access to buckets used for ML projects. This ensures that only people you explicitly grant permissions to can read or write objects, which is almost always the right choice for research, hackathons, or internal collaboration. Public buckets are mainly for hosting datasets or websites that are intentionally shared with the world.
4. Upload files to the bucket
- If you haven’t downloaded them yet, right-click and save as
.csv
: - In the bucket dashboard, click Upload Files.
- Select your Titanic CSVs and upload.
Note the GCS URI for your data After uploading,
click on a file and find its gs:// URI (e.g.,
gs://yourname-titanic-gcs/titanic_train.csv
). This URI will
be used to access the data later.
GCS bucket costs
GCS costs are based on storage class, data transfer, and operations (requests).
Storage costs
- Standard storage (us-central1): ~$0.02 per GB per month.
- Other classes (Nearline, Coldline, Archive) are cheaper but with retrieval costs.
Data transfer costs explained
-
Uploading data (ingress): Copying data into a GCS
bucket from your laptop, campus HPC, or another provider is free.
-
Accessing data in the same region: If your bucket
and your compute resources (VMs, Vertex AI jobs) are in the same region,
you can read and stream data with no transfer fees. You only pay the
storage cost per GB-month.
-
Cross-region access: If your bucket is in one
region and your compute runs in another, you’ll pay an egress fee (about
$0.01–0.02 per GB within North America, higher if crossing
continents).
-
Downloading data out of GCP (egress): This refers
to data leaving Google’s network to the public internet, such as
downloading files to your laptop. Typical cost is around $0.12 per GB to
the U.S. and North America, more for other continents.
- Deleting data: Removing objects or buckets does not incur transfer costs. If you download data before deleting, you pay for the egress, but simply deleting in the console or CLI is free. For Nearline/Coldline/Archive storage classes, deleting before the minimum storage duration (30, 90, or 365 days) triggers an early deletion fee.
Request costs
-
GET
(read) requests: ~$0.004 per 10,000 requests.
-
PUT
(write) requests: ~$0.05 per 10,000 requests.
For detailed pricing, see GCS Pricing Information.
Challenge: Estimating Storage Costs
1. Estimate the total cost of storing 1 GB in GCS Standard
storage (us-central1) for one month assuming:
- Storage duration: 1 month
- Dataset retrieved 100 times for model training and tuning
- Data is downloaded once out of GCP at the end of the project
Hints
- Storage cost: $0.02 per GB per month
- Egress (download out of GCP): $0.12 per GB
- GET
requests: $0.004 per 10,000 requests (100 requests ≈
free for our purposes)
2. Repeat the above calculation for datasets of 10 GB, 100 GB, and 1 TB (1024 GB).
-
1 GB:
- Storage: 1 GB × $0.02 = $0.02
- Egress: 1 GB × $0.12 = $0.12
- Requests: ~0 (100 reads well below pricing tier)
- Total: $0.14
-
10 GB:
- Storage: 10 GB × $0.02 = $0.20
- Egress: 10 GB × $0.12 = $1.20
- Requests: ~0
- Total: $1.40
-
100 GB:
- Storage: 100 GB × $0.02 = $2.00
- Egress: 100 GB × $0.12 = $12.00
- Requests: ~0
- Total: $14.00
-
1 TB (1024 GB):
- Storage: 1024 GB × $0.02 = $20.48
- Egress: 1024 GB × $0.12 = $122.88
- Requests: ~0
- Total: $143.36
Removing unused data (complete after the workshop)
After you are done using your data, remove unused files/buckets to stop costs:
-
Option 1: Delete files only – if you plan to reuse
the bucket.
- Option 2: Delete the bucket entirely – if you no longer need it.
When does BigQuery come into play?
For many ML workflows, especially smaller projects or those centered on image, text, or modest tabular datasets, BigQuery is overkill. GCS buckets are usually enough to store and access your data for training jobs. That said, BigQuery can be valuable when you are working with large tabular datasets and need a shared environment for exploration or collaboration. Instead of every team member downloading the same CSVs, BigQuery lets everyone query the data in place with SQL, share results through saved queries or views, and control access at the dataset or table level with IAM. BigQuery also integrates with Vertex AI, so if your data is already structured and stored there, you can connect it directly to training pipelines. The trade-off is cost: you pay not only for storage but also for the amount of data scanned by queries. For many ML research projects this is unnecessary, but when teams need a centralized, queryable workspace for large tabular data, BigQuery can simplify collaboration.
- Use GCS for scalable, cost-effective, and persistent storage in
GCP.
- Persistent disks are suitable only for small, temporary
datasets.
- Track your storage, transfer, and request costs to manage
expenses.
- Regularly delete unused data or buckets to avoid ongoing costs.
Content from Notebooks as Controllers
Last updated on 2025-09-22 | Edit this page
Overview
Questions
- How do you set up and use Vertex AI Workbench notebooks for machine
learning tasks?
- How can you manage compute resources efficiently using a “controller” notebook approach in GCP?
Objectives
- Describe how to use Vertex AI Workbench notebooks for ML
workflows.
- Set up a Jupyter-based Workbench instance as a controller to manage
compute tasks.
- Use the Vertex AI SDK to launch training and tuning jobs on scalable instances.
Setting up our notebook environment
Google Cloud Workbench provides JupyterLab-based environments that can be used to orchestrate machine learning workflows. In this workshop, we will use a Workbench Instance—the recommended option going forward, as other Workbench environments are being deprecated.
Workbench Instances come with JupyterLab 3 pre-installed and are configured with GPU-enabled ML frameworks (TensorFlow, PyTorch, etc.), making it easy to start experimenting without additional setup. Learn more in the Workbench Instances documentation.
Using the notebook as a controller
The notebook instance functions as a controller to manage
more resource-intensive tasks. By selecting a modest machine type (e.g.,
n1-standard-4
), you can perform lightweight operations
locally in the notebook while using the Vertex AI Python
SDK to launch compute-heavy jobs on larger machines (e.g.,
GPU-accelerated) when needed.
This approach minimizes costs while giving you access to scalable infrastructure for demanding tasks like model training, batch prediction, and hyperparameter tuning.
We will follow these steps to create our first Workbench Instance:
1. Navigate to Workbench
- In the Google Cloud Console, search for “Workbench.”
- Click the “Instances” tab (this is the supported path going
forward).
- Pin Workbench to your navigation bar for quick access.
2. Create a new Workbench Instance
- Click “Create New” under Instances.
-
Notebook name: For this workshop, we can use the
following naming convention to easily locate our notebooks:
lastname-titanic
-
Region: Choose the same region as your storage
bucket (e.g.,
us-central1
).- This avoids cross-region transfer charges and keeps data access
latency low.
- This avoids cross-region transfer charges and keeps data access
latency low.
-
GPUs: Leave disabled for now (training jobs will
request them separately).
-
Labels: Add labels for cost tracking
purpose=workshop
owner=lastname_firstname
-
Machine type: Select a small machine (e.g.,
e2-standard-2
) to act as the controller.- This keeps costs low while you delegate heavy lifting to training
jobs.
- For guidance on common machine types for ML, refer to Instances for ML on GCP.
- This keeps costs low while you delegate heavy lifting to training
jobs.
- Click Create to create the intance. Your notebook instance will start in a few minutes. When its status is “Running,” you can open JupyterLab and begin working.
Managing training and tuning with the controller notebook
In the following episodes, we will use the Vertex AI Python
SDK (google-cloud-aiplatform
) from this notebook
to submit compute-heavy tasks on more powerful machines. Examples
include:
- Training a model on a GPU-backed instance.
- Running hyperparameter tuning jobs managed by Vertex AI.
This pattern keeps costs low by running your notebook on a modest VM while only incurring charges for larger resources when they are actively in use.
Challenge: Notebook Roles
Your university provides different compute options: laptops, on-prem HPC, and GCP.
- What role does a Workbench Instance notebook play
compared to an HPC login node or a laptop-based JupyterLab?
- Which tasks should stay in the notebook (lightweight control, visualization) versus being launched to larger cloud resources?
The notebook serves as a lightweight control plane.
- Like an HPC login node, it is not meant for heavy computation.
- Suitable for small preprocessing, visualization, and orchestrating
jobs.
- Resource-intensive tasks (training, tuning, batch jobs) should be
submitted to scalable cloud resources (GPU/large VM instances) via the
Vertex AI SDK.
- Use a small Workbench Instance notebook as a controller to manage
larger, resource-intensive tasks.
- Always navigate to the “Instances” tab in Workbench, since older
notebook types are deprecated.
- Choose the same region for your Workbench Instance and storage
bucket to avoid extra transfer costs.
- Submit training and tuning jobs to scalable instances using the
Vertex AI SDK.
- Labels help track costs effectively, especially in shared or
multi-project environments.
- Workbench Instances come with JupyterLab 3 and GPU frameworks
preinstalled, making them an easy entry point for ML workflows.
- Enable idle auto-stop to avoid unexpected charges when notebooks are left running.
Content from Accessing and Managing Data in GCS with Vertex AI Notebooks
Last updated on 2025-09-22 | Edit this page
Overview
Questions
- How can I load data from GCS into a Vertex AI Workbench
notebook?
- How do I monitor storage usage and costs for my GCS bucket?
- What steps are involved in pushing new data back to GCS from a notebook?
Objectives
- Read data directly from a GCS bucket into memory in a Vertex AI
notebook.
- Check storage usage and estimate costs for data in a GCS
bucket.
- Upload new files from the Vertex AI environment back to the GCS bucket.
Initial setup
Open JupyterLab notebook
Once your Vertex AI Workbench notebook instance shows as
Running, open it in JupyterLab. Create a new Python 3
notebook and rename it to: Interacting-with-GCS.ipynb
.
Reading data from GCS
As with S3, you can either (A) read data directly from GCS into memory, or (B) download a copy into your notebook VM. Since we’re using notebooks as controllers rather than training environments, the recommended approach is reading directly from GCS.
Checking storage usage of a bucket
Estimating storage costs
PYTHON
storage_price_per_gb = 0.02 # $/GB/month for Standard storage
total_size_gb = total_size_bytes / (1024**3)
monthly_cost = total_size_gb * storage_price_per_gb
print(f"Estimated monthly cost: ${monthly_cost:.4f}")
print(f"Estimated annual cost: ${monthly_cost*12:.4f}")
For updated prices, see GCS Pricing.
Writing output files to GCS
PYTHON
# Create a sample file locally on the notebook VM
with open("Notes.txt", "w") as f:
f.write("This is a test note for GCS.")
# Point to the right bucket
bucket = client.bucket(bucket_name)
# Create a *Blob* object, which represents a path inside the bucket
# (here it will end up as gs://<bucket_name>/docs/Notes.txt)
blob = bucket.blob("docs/Notes.txt")
# Upload the local file into that blob (object) in GCS
blob.upload_from_filename("Notes.txt")
print("File uploaded successfully.")
List bucket contents:
Challenge: Estimating GCS Costs
Suppose you store 50 GB of data in Standard storage
(us-central1) for one month.
- Estimate the monthly storage cost.
- Then estimate the cost if you download (egress) the entire dataset
once at the end of the month.
Hints
- Storage: $0.02 per GB-month
- Egress: $0.12 per GB
- Storage cost: 50 GB × $0.02 = $1.00
- Egress cost: 50 GB × $0.12 = $6.00
- Total cost: $7.00 for one month including one full download
- Load data from GCS into memory to avoid managing local copies when
possible.
- Periodically check storage usage and costs to manage your GCS
budget.
- Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.
Content from Using a GitHub Personal Access Token (PAT) to Push/Pull from a Vertex AI Notebook
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How can I securely push/pull code to and from GitHub within a Vertex
AI Workbench notebook?
- What steps are necessary to set up a GitHub PAT for authentication
in GCP?
- How can I convert notebooks to
.py
files and ignore.ipynb
files in version control?
Objectives
- Configure Git in a Vertex AI Workbench notebook to use a GitHub
Personal Access Token (PAT) for HTTPS-based authentication.
- Securely handle credentials in a notebook environment using
getpass
.
- Convert
.ipynb
files to.py
files for better version control practices in collaborative projects.
Step 0: Initial setup
In the previous episode, we cloned our forked repository as part of the workshop setup. In this episode, we’ll see how to push our code to this fork. Complete these three setup steps before moving forward.
Clone the fork if you haven’t already. See previous episode.
Start a new Jupyter notebook, and name it something like
Interacting-with-git.ipynb
. We can use the default Python 3 kernel in Vertex AI Workbench.Change directory to the workspace where your repository is located. In Vertex AI Workbench, notebooks usually live under
/home/jupyter/
.
Step 1: Using a GitHub personal access token (PAT) to push/pull from a Vertex AI notebook
When working in Vertex AI Workbench notebooks, you may often need to push code updates to GitHub repositories. Since Workbench VMs may be stopped and restarted, configurations like SSH keys may not persist. HTTPS-based authentication with a GitHub Personal Access Token (PAT) is a practical solution. PATs provide flexibility for authentication and enable seamless interaction with both public and private repositories directly from your notebook.
Important Note: Personal access tokens are powerful credentials. Select the minimum necessary permissions and handle the token carefully.
Generate a personal access token (PAT) on GitHub
- Go to Settings in GitHub.
- Click Developer settings at the bottom of the left
sidebar.
- Select Personal access tokens, then click
Tokens (classic).
- Click Generate new token (classic).
- Give your token a descriptive name and set an expiration date if
desired.
-
Select minimum permissions:
- Public repos:
public_repo
- Private repos:
repo
- Public repos:
- Click Generate token and copy it immediately—you won’t be able to see it again.
Caution: Treat your PAT like a password. Don’t share it or expose it in your code. Use a password manager to store it.
Step 2: Configure Git settings
PYTHON
!git config --global user.name "Your Name"
!git config --global user.email your_email@wisc.edu
-
user.name
: Will appear in the commit history.
-
user.email
: Must match your GitHub account so commits are linked to your profile.
Step 3: Convert .ipynb
notebooks to
.py
Tracking .py
files instead of .ipynb
helps
with cleaner version control. Notebooks store outputs and metadata,
which makes diffs noisy. .py
files are lighter and easier
to review.
- Install Jupytext.
- Convert a notebook to
.py
.
- Convert all notebooks in the current directory.
Step 4: Add and commit .py
files
Step 5: Add .ipynb
to .gitignore
PYTHON
!touch .gitignore
with open(".gitignore", "a") as gitignore:
gitignore.write("\n# Ignore Jupyter notebooks\n*.ipynb\n")
!cat .gitignore
Add other temporary files too:
PYTHON
with open(".gitignore", "a") as gitignore:
gitignore.write("\n# Ignore cache and temp files\n__pycache__/\n*.tmp\n*.log\n")
Commit the .gitignore
:
Step 6: Syncing with GitHub
First, pull the latest changes:
If conflicts occur, resolve manually before committing.
Then push with your PAT credentials:
Step 7: Convert .py
back to notebooks (optional)
To convert .py
files back to .ipynb
after
pulling updates:
Challenge: GitHub PAT Workflow
- Why might you prefer using a PAT with HTTPS instead of SSH keys in
Vertex AI Workbench?
- What are the benefits of converting
.ipynb
files to.py
before committing to a shared repo?
- PATs with HTTPS are easier to set up in temporary environments where
SSH configs don’t persist.
- Converting notebooks to
.py
results in cleaner diffs, easier code review, and smaller repos without stored outputs/metadata.
- Use a GitHub PAT for HTTPS-based authentication in Vertex AI
Workbench notebooks.
- Securely enter sensitive information in notebooks using
getpass
.
- Converting
.ipynb
files to.py
files helps with cleaner version control.
- Adding
.ipynb
files to.gitignore
keeps your repository organized.
Content from Training Models in Vertex AI: Intro
Last updated on 2025-09-24 | Edit this page
Overview
Questions
- What are the differences between training locally in a Vertex AI
notebook and using Vertex AI-managed training jobs?
- How do custom training jobs in Vertex AI streamline the training
process for various frameworks?
- How does Vertex AI handle scaling across CPUs, GPUs, and TPUs?
Objectives
- Understand the difference between local training in a Vertex AI
Workbench notebook and submitting managed training jobs.
- Learn to configure and use Vertex AI custom training jobs for
different frameworks (e.g., XGBoost, PyTorch, SKLearn).
- Understand scaling options in Vertex AI, including when to use CPUs,
GPUs, or TPUs.
- Compare performance, cost, and setup between custom scripts and
pre-built containers in Vertex AI.
- Conduct training with data stored in GCS and monitor training job status using the Google Cloud Console.
Initial setup
1. Open a new .ipynb notebook
Open a fresh Jupyter notebook inside your Vertex AI Workbench
instance. You can name it something along the lines of,
Training-models.ipynb
.
2. CD to instance home directory
So we all can reference helper functions consistently, change directory to your Jupyter home directory.
3. Initialize Vertex AI environment
This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.
PYTHON
from google.cloud import aiplatform
import pandas as pd
# Set your project and region (replace with your values)
PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-gcs-bucket"
# Initialize Vertex AI client
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")
-
aiplatform.init()
: Sets defaults for project, region, and staging bucket.
-
PROJECT_ID
: Identifies your GCP project.
-
REGION
: Determines where training jobs run (choose a region close to your data).
-
staging_bucket
: A GCS bucket for storing datasets, model artifacts, and job outputs.
Testing train.py locally in the notebook
Before scaling training jobs onto managed resources, it’s essential to test your training script locally. This prevents wasting GPU/TPU time on bugs or misconfigured code.
Guidelines for testing ML pipelines before scaling
-
Run tests locally first with small datasets.
-
Use a subset of your dataset (1–5%) for fast
checks.
-
Start with minimal compute before moving to larger
accelerators.
-
Log key metrics such as loss curves and
runtimes.
- Verify correctness first before scaling up.
What tests should we do before scaling?
Before scaling to multiple or more powerful instances (e.g., GPUs or TPUs), it’s important to run a few sanity checks. In your group, discuss:
- Which checks do you think are most critical before scaling up?
- What potential issues might we miss if we skip this step?
-
Data loads correctly – dataset loads without
errors, expected columns exist, missing values handled.
-
Overfitting check – train on a tiny dataset (e.g.,
100 rows). If it doesn’t overfit, something is off.
-
Loss behavior – verify training loss decreases and
doesn’t diverge.
-
Runtime estimate – get a rough sense of training
time on small data.
-
Memory estimate – check approximate memory
use.
- Save & reload – ensure model saves, reloads, and infers without errors.
Skipping these can lead to: silent data bugs, runtime blowups at scale, inefficient experiments, or broken model artifacts.
Download data into notebook environment
Sometimes it’s helpful to keep a copy of data in your notebook VM for quick iteration, even though GCS is the preferred storage location.
Local test run of train.py
PYTHON
import time as t
start = t.time()
# Example: run your custom training script with args
!python GCP_helpers/train_xgboost.py --max_depth 3 --eta 0.1 --subsample 0.8 --colsample_bytree 0.8 --num_round 100 --train titanic_train.csv
print(f"Total local runtime: {t.time() - start:.2f} seconds")
Training on this small dataset should take <1 minute. Log runtime as a baseline. You should see the following output files:
- xgboost-model.joblib # Python-serialized XGBoost model (Booster) via joblib; load with joblib.load for reuse.
- eval_history.csv # Per-iteration validation metrics; columns: iter,val_logloss (good for plotting learning curves).
- training.log # Full stdout/stderr from the run (params, dataset sizes, timings, warnings/errors) for audit/debug.
- metrics.json # Structured summary: final_val_logloss, num_boost_round, params, train_rows/val_rows, features[], model_uri.
Training via Vertex AI custom training job
Unlike “local” training, this launches a managed training job that runs on scalable compute. Vertex AI handles provisioning, scaling, logging, and saving outputs to GCS.
Which machine type to start with?
Start with a small CPU machine like n1-standard-4
. Only
scale up to GPUs/TPUs once you’ve verified your script. See Instances for ML on GCP for
guidance.
Creating a custom training job with the SDK
PYTHON
from google.cloud import aiplatform
import datetime as dt
PROJECT = "doit-rci-mlm25-4626"
REGION = "us-central1"
BUCKET = bucket_name # e.g., "endemann_titanic" (same region as REGION)
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
MODEL_URI = f"gs://{BUCKET}/artifacts/xgb/{RUN_ID}/model.joblib" # everything will live beside this
# Staging bucket is only for the SDK's temp code tarball (aiplatform-*.tar.gz)
aiplatform.init(project=PROJECT, location=REGION, staging_bucket=f"gs://{BUCKET}")
job = aiplatform.CustomTrainingJob(
display_name=f"endemann_xgb_{RUN_ID}",
script_path="Intro_GCP_VertexAI/code/train_xgboost.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.2-1:latest",
requirements=["gcsfs"], # script writes gs://MODEL_URI and sidecar files
)
job.run(
args=[
f"--train=gs://{BUCKET}/titanic_train.csv",
f"--model_out={MODEL_URI}", # model, metrics.json, eval_history.csv, training.log all go here
"--max_depth=3",
"--eta=0.1",
"--subsample=0.8",
"--colsample_bytree=0.8",
"--num_round=100",
],
replica_count=1,
machine_type="n1-standard-4",
sync=True,
)
print("Model + logs folder:", MODEL_URI.rsplit("/", 1)[0])
This launches a managed training job with Vertex AI.
Monitoring training jobs in the Console
- Go to the Google Cloud Console.
- Navigate to Vertex AI > Training > Custom
Jobs.
- Click on your job name to see status, logs, and output model
artifacts.
- Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).
Visit “training pipelines” to verify it’s running. It may take around 5 minutes to finish.
Should output the following files:
- endemann_titanic/artifacts/xgb/20250924-154740/xgboost-model.joblib # Python-serialized XGBoost model (Booster) via joblib; load with joblib.load for reuse.
- endemann_titanic/artifacts/xgb/20250924-154740/eval_history.csv # Per-iteration validation metrics; columns: iter,val_logloss (good for plotting learning curves).
- endemann_titanic/artifacts/xgb/20250924-154740/training.log # Full stdout/stderr from the run (params, dataset sizes, timings, warnings/errors) for audit/debug.
- endemann_titanic/artifacts/xgb/20250924-154740/metrics.json # Structured summary: final_val_logloss, num_boost_round, params, train_rows/val_rows, features[], model_uri.
When training takes too long
Two main options in Vertex AI:
-
Option 1: Upgrade to more powerful machine types
(e.g., add GPUs like T4, V100, A100).
- Option 2: Use distributed training with multiple replicas.
Option 1: Upgrade machine type (preferred first step)
- Works best for small/medium datasets (<10 GB).
- Avoids the coordination overhead of distributed training.
- GPUs/TPUs accelerate deep learning tasks significantly.
Option 2: Distributed training with multiple replicas
- Supported in Vertex AI for many frameworks.
- Split data across replicas, each trains a portion, gradients
synchronized.
- More beneficial for very large datasets and long-running jobs.
When distributed training makes sense
- Dataset >10–50 GB.
- Training time >10 hours on single machine.
- Deep learning workloads that naturally parallelize across GPUs/TPUs.
-
Environment initialization: Use
aiplatform.init()
to set defaults for project, region, and bucket.
-
Local vs managed training: Test locally before
scaling into managed jobs.
-
Custom jobs: Vertex AI lets you run scripts as
managed training jobs using pre-built or custom containers.
-
Scaling: Start small, then scale up to GPUs or
distributed jobs as dataset/model size grows.
- Monitoring: Track job logs and artifacts in the Vertex AI Console.
Content from Training Models in Vertex AI: PyTorch Example
Last updated on 2025-09-24 | Edit this page
Overview
Questions
- When should you consider a GPU (or TPU) instance for PyTorch training in Vertex AI, and what are the trade‑offs for small vs. large workloads?
- How do you launch a script‑based training job and write all artifacts (model, metrics, logs) next to each other in GCS without deploying a managed model?
Objectives
- Prepare the Titanic dataset and save train/val arrays to compressed
.npz
files in GCS. - Submit a CustomTrainingJob that runs a PyTorch
script and explicitly writes outputs to a chosen
gs://…/artifacts/.../
folder. - Co‑locate artifacts:
model.pt
(or.joblib
),metrics.json
,eval_history.csv
, andtraining.log
for reproducibility. - Choose CPU vs. GPU instances sensibly; understand when distributed training is (not) worth it.
Initial setup (controller notebook)
Open a fresh Jupyter notebook in Vertex AI Workbench (Instances tab) and initialize:
PYTHON
from google.cloud import aiplatform, storage
import datetime as dt
PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-bucket" # same region as REGION
# Only used for the SDK's small packaging tarball.
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")
Select the PyTorch environment (kernel)
-
In JupyterLab, click the kernel name (top‑right) and switch to a PyTorch‑ready kernel. On Workbench Instances this is usually available out‑of‑the‑box; if
import torch
fails, install locally: -
Quick check that your kernel can see PyTorch (and optionally CUDA if your VM has a GPU):
Note: local PyTorch is only needed for local tests. Your Vertex AI job uses the container specified by
container_uri
(e.g.,pytorch-cpu.2-1
orpytorch-gpu.2-1
), so it brings its own framework at run time.
Notes: - The staging bucket only stores the SDK’s temporary tar.gz of
your training code. - We will not use
base_output_dir
; your script will write everything under a
single gs://…/artifacts/.../
path.
Prepare data as .npz
PYTHON
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load Titanic CSV (from local or GCS you've already downloaded to the notebook)
df = pd.read_csv("titanic_train.csv")
# Minimal preprocessing to numeric arrays
sex_enc = LabelEncoder().fit(df["Sex"])
df["Sex"] = sex_enc.transform(df["Sex"])
df["Embarked"] = df["Embarked"].fillna("S")
emb_enc = LabelEncoder().fit(df["Embarked"])
df["Embarked"] = emb_enc.transform(df["Embarked"])
df["Age"] = df["Age"].fillna(df["Age"].median())
df["Fare"] = df["Fare"].fillna(df["Fare"].median())
X = df[["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].values
y = df["Survived"].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
np.savez("train_data.npz", X_train=X_train, y_train=y_train)
np.savez("val_data.npz", X_val=X_val, y_val=y_val)
# Upload to GCS
client = storage.Client()
bucket = client.bucket(BUCKET_NAME)
bucket.blob("data/train_data.npz").upload_from_filename("train_data.npz")
bucket.blob("data/val_data.npz").upload_from_filename("val_data.npz")
print("Uploaded: gs://%s/data/train_data.npz and val_data.npz" % BUCKET_NAME)
Why .npz
?
- Smaller, faster I/O than CSV for arrays.
- Natural fit for
torch.utils.data.Dataset
/DataLoader
. - One file can hold multiple arrays (
X_train
,y_train
).
Minimal PyTorch training script (train_nn.py
)
Place this file in your repo (e.g.,
GCP_helpers/train_nn.py
). It does three things: 1) loads
.npz
from local or GCS, 2) trains a tiny MLP, 3)
writes all outputs side‑by‑side (model + metrics + eval
history + training.log) to the same --model_out
folder.
PYTHON
# GCP_helpers/train_nn.py
import argparse, io, json, os, sys
import numpy as np
import torch, torch.nn as nn
from time import time
# --- small helpers for GCS/local I/O ---
def _parent_dir(p):
return p.rsplit("/", 1)[0] if p.startswith("gs://") else (os.path.dirname(p) or ".")
def _write_bytes(path: str, data: bytes):
if path.startswith("gs://"):
try:
import fsspec
with fsspec.open(path, "wb") as f:
f.write(data)
except Exception:
from google.cloud import storage
b, k = path[5:].split("/", 1)
storage.Client().bucket(b).blob(k).upload_from_string(data)
else:
os.makedirs(_parent_dir(path), exist_ok=True)
with open(path, "wb") as f:
f.write(data)
def _write_text(path: str, text: str):
_write_bytes(path, text.encode("utf-8"))
# --- tiny MLP ---
class MLP(nn.Module):
def __init__(self, d_in):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_in, 32), nn.ReLU(),
nn.Linear(32, 16), nn.ReLU(),
nn.Linear(16, 1), nn.Sigmoid(),
)
def forward(self, x):
return self.net(x)
class _Tee:
def __init__(self, *s): self.s = s
def write(self, d):
for x in self.s: x.write(d); x.flush()
def flush(self):
for x in self.s: x.flush()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--train", required=True)
ap.add_argument("--val", required=True)
ap.add_argument("--epochs", type=int, default=100)
ap.add_argument("--learning_rate", type=float, default=1e-3)
ap.add_argument("--model_out", required=True, help="gs://…/artifacts/.../model.pt")
args = ap.parse_args()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# All artifacts will sit next to model_out
model_path = args.model_out
art_dir = _parent_dir(model_path)
# capture stdout/stderr
buf = io.StringIO()
orig_out, orig_err = sys.stdout, sys.stderr
sys.stdout = _Tee(sys.stdout, buf)
sys.stderr = _Tee(sys.stderr, buf)
log_path = f"{art_dir}/training.log"
try:
# Load npz (supports gs:// via fsspec)
def _npz_load(p):
if p.startswith("gs://"):
import fsspec
with fsspec.open(p, "rb") as f:
by = f.read()
return np.load(io.BytesIO(by))
else:
return np.load(p)
train = _npz_load(args.train)
val = _npz_load(args.val)
Xtr, ytr = train["X_train"].astype("float32"), train["y_train"].astype("float32")
Xva, yva = val["X_val"].astype("float32"), val["y_val"].astype("float32")
Xtr_t = torch.from_numpy(Xtr).to(device)
ytr_t = torch.from_numpy(ytr).view(-1,1).to(device)
Xva_t = torch.from_numpy(Xva).to(device)
yva_t = torch.from_numpy(yva).view(-1,1).to(device)
model = MLP(Xtr.shape[1]).to(device)
opt = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
loss_fn = nn.BCELoss()
hist = []
t0 = time()
for ep in range(1, args.epochs+1):
model.train()
opt.zero_grad()
pred = model(Xtr_t)
loss = loss_fn(pred, ytr_t)
loss.backward(); opt.step()
model.eval()
with torch.no_grad():
val_loss = loss_fn(model(Xva_t), yva_t).item()
hist.append(val_loss)
if ep % 10 == 0 or ep == 1:
print(f"epoch={ep} val_loss={val_loss:.4f}")
print(f"Training time: {time()-t0:.2f}s on {device}")
# save model
torch.save(model.state_dict(), model_path)
print(f"[INFO] Saved model: {model_path}")
# metrics.json and eval_history.csv
import json
metrics = {
"final_val_loss": float(hist[-1]) if hist else None,
"epochs": int(args.epochs),
"learning_rate": float(args.learning_rate),
"train_rows": int(Xtr.shape[0]),
"val_rows": int(Xva.shape[0]),
"features": list(range(Xtr.shape[1])),
"model_uri": model_path,
"device": str(device),
}
from io import StringIO
_write_text(f"{art_dir}/metrics.json", json.dumps(metrics, indent=2))
csv = "iter,val_loss\n" + "\n".join(f"{i+1},{v}" for i, v in enumerate(hist))
_write_text(f"{art_dir}/eval_history.csv", csv)
finally:
# persist log and restore streams
try:
_write_text(log_path, buf.getvalue())
except Exception as e:
print(f"[WARN] could not write log: {e}")
sys.stdout, sys.stderr = orig_out, orig_err
if __name__ == "__main__":
main()
Launch the training job (no base_output_dir)
PYTHON
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch/{RUN_ID}"
MODEL_URI = f"{ARTIFACT_DIR}/model.pt" # model + metrics + logs will live here together
job = aiplatform.CustomTrainingJob(
display_name=f"pytorch_nn_{RUN_ID}",
script_path="GCP_helpers/train_nn.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-cpu.2-1:latest", # or pytorch-gpu.2-1
requirements=["torch", "numpy", "fsspec", "gcsfs"],
)
job.run(
args=[
f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
f"--epochs=200",
f"--learning_rate=0.001",
f"--model_out={MODEL_URI}", # drives where *all* artifacts go
],
replica_count=1,
machine_type="n1-standard-4", # CPU fine for small datasets
sync=True,
)
print("Artifacts folder:", ARTIFACT_DIR)
What you’ll see in
gs://…/artifacts/pytorch/<RUN_ID>/
: -
model.pt
— PyTorch weights (state_dict
). -
metrics.json
— final val loss, hyperparameters, dataset
sizes, device, model URI. - eval_history.csv
— per‑epoch
validation loss (for plots/regression checks). -
training.log
— complete stdout/stderr for reproducibility
and debugging.
Optional: GPU training
For larger models or heavier data:
PYTHON
job = aiplatform.CustomTrainingJob(
display_name=f"pytorch_nn_gpu_{RUN_ID}",
script_path="GCP_helpers/train_nn.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-1:latest",
requirements=["torch", "numpy", "fsspec", "gcsfs"],
)
job.run(
args=[
f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
f"--epochs=200",
f"--learning_rate=0.001",
f"--model_out={MODEL_URI}",
],
replica_count=1,
machine_type="n1-standard-8",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
sync=True,
)
GPU tips: - On small problems, GPU startup/transfer overhead can erase speedups—benchmark before you scale. - Stick to a single replica unless your batch sizes and dataset really warrant data parallelism.
Distributed training (when to consider)
- Data parallelism (DDP) helps when a single GPU is saturated by batch size/throughput. For most workshop‑scale models, a single machine/GPU is simpler and cheaper.
- Model parallelism is for very large networks that don’t fit on one device—overkill for this lesson.
Monitoring jobs & finding outputs
- Console → Vertex AI → Training → Custom Jobs → your run → “Output
directory” shows the container logs and the environment’s
AIP_MODEL_DIR
. - Your script writes model + metrics + eval history +
training.log next to
--model_out
, e.g.,gs://<bucket>/artifacts/pytorch/<RUN_ID>/
.
- Use CustomTrainingJob with a prebuilt PyTorch
container; let your script control outputs via
--model_out
. - Keep artifacts together (model, metrics, history, log) in one folder for reproducibility.
-
.npz
speeds up loading and plays nicely with PyTorch. - Start on CPU for small datasets; use GPU only when profiling shows a clear win.
- Skip
base_output_dir
unless you specifically want Vertex’s default run directory; staging bucket is just for the SDK packaging tarball.
Content from Hyperparameter Tuning in Vertex AI: Neural Network Example
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How can we efficiently manage hyperparameter tuning in Vertex
AI?
- How can we parallelize tuning jobs to optimize time without increasing costs?
Objectives
- Set up and run a hyperparameter tuning job in Vertex AI.
- Define search spaces for
ContinuousParameter
andCategoricalParameter
.
- Log and capture objective metrics for evaluating tuning
success.
- Optimize tuning setup to balance cost and efficiency, including parallelization.
To conduct efficient hyperparameter tuning with neural networks (or any model) in Vertex AI, we’ll use Vertex AI’s Hyperparameter Tuning Jobs. The key is defining a clear search space, ensuring metrics are properly logged, and keeping costs manageable by controlling the number of trials and level of parallelization.
Key steps for hyperparameter tuning
The overall process involves these steps:
- Prepare training script and ensure metrics are logged.
- Define hyperparameter search space.
- Configure a hyperparameter tuning job in Vertex AI.
- Set data paths and launch the tuning job.
- Monitor progress in the Vertex AI Console.
- Extract best model and evaluate.
1. Prepare training script with metric logging
Your training script (train_nn.py
) should periodically
print validation accuracy in a format that Vertex AI can capture.
PYTHON
if (epoch + 1) % 100 == 0 or epoch == epochs - 1:
print(f"validation_accuracy: {val_accuracy:.4f}", flush=True)
Vertex AI automatically captures metrics logged in this format
(key: value
).
2. Define hyperparameter search space
In Vertex AI, you specify hyperparameter ranges when configuring the tuning job. You can define both discrete and continuous ranges.
PYTHON
parameter_spec = {
"epochs": aiplatform.hyperparameter_tuning_utils.IntegerParameterSpec(min=100, max=1000, scale="linear"),
"learning_rate": aiplatform.hyperparameter_tuning_utils.DoubleParameterSpec(min=0.001, max=0.1, scale="log")
}
-
IntegerParameterSpec: Defines integer ranges.
- DoubleParameterSpec: Defines continuous ranges, with optional scaling.
3. Configure hyperparameter tuning job
PYTHON
from google.cloud import aiplatform
job = aiplatform.CustomJob(
display_name="pytorch-train-hpt",
script_path="GCP_helpers/train_nn.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
requirements=["torch", "pandas", "numpy", "scikit-learn"],
model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-13:latest",
)
hpt_job = aiplatform.HyperparameterTuningJob(
display_name="pytorch-hpt-job",
custom_job=job,
metric_spec={"validation_accuracy": "maximize"},
parameter_spec=parameter_spec,
max_trial_count=4,
parallel_trial_count=2,
)
4. Launch the hyperparameter tuning job
PYTHON
hpt_job.run(
machine_type="n1-standard-4",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
args=[
"--train=gs://{}/train_data.npz".format(BUCKET_NAME),
"--val=gs://{}/val_data.npz".format(BUCKET_NAME),
"--epochs=100",
"--learning_rate=0.001"
]
)
-
max_trial_count: Total number of configurations
tested.
- parallel_trial_count: Number of trials run at once (recommend ≤4 to let adaptive search improve).
5. Monitor tuning job in Vertex AI Console
- Navigate to Vertex AI > Training > Hyperparameter
tuning jobs.
- View trial progress, logs, and metrics.
- Cancel jobs from the console if needed.
6. Extract and evaluate the best model
PYTHON
best_trial = hpt_job.trials[0] # Best trial listed first after completion
print("Best hyperparameters:", best_trial.parameters)
print("Best objective value:", best_trial.final_measurement.metrics)
You can then load the best model artifact from the associated GCS path and evaluate on test data.
What is the effect of parallelism in tuning?
- How might running 10 trials in parallel differ from running 2 at a
time in terms of cost, time, and quality of results?
- When would you want to prioritize speed over adaptive search benefits?
- Vertex AI Hyperparameter Tuning Jobs let you efficiently explore
parameter spaces using adaptive strategies.
- Always test with
max_trial_count=1
first to confirm your setup works.
- Limit
parallel_trial_count
to a small number (2–4) to benefit from adaptive search.
- Use GCS for input/output and monitor jobs through the Vertex AI Console.
Content from Resource Management & Monitoring on Vertex AI (GCP)
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How do I monitor and control Vertex AI, Workbench, and GCS costs day‑to‑day?
- What specifically should I stop, delete, or schedule to avoid surprise charges?
- How can I automate cleanup and set alerting so leaks get caught quickly?
Objectives
- Identify all major cost drivers across Vertex AI (training jobs, endpoints, Workbench notebooks, batch prediction) and GCS.
- Practice safe cleanup for Managed and User‑Managed Workbench notebooks, training/tuning jobs, batch predictions, models, endpoints, and artifacts.
- Configure budgets, labels, and basic lifecycle policies to keep costs predictable.
- Use
gcloud
/gsutil
commands for auditing and rapid cleanup; understand when to prefer the Console. - Draft simple automation patterns (Cloud Scheduler +
gcloud
) to enforce idle shutdown.
What costs you money on GCP (quick map)
- Vertex AI training jobs (Custom Jobs, Hyperparameter Tuning Jobs) — billed per VM/GPU hour while running.
- Vertex AI endpoints (online prediction) — billed per node‑hour 24/7 while deployed, even if idle.
- Vertex AI batch prediction jobs — billed for the job’s compute while running.
- Vertex AI Workbench notebooks — the backing VM and disk bill while running (and disks bill even when stopped).
- GCS buckets — storage class, object count/size, versioning, egress, and request ops.
- Artifact Registry (containers, models) — storage for images and large artifacts.
- Network egress — downloading data out of GCP (e.g., to your laptop) incurs cost.
- Logging/Monitoring — high‑volume logs/metrics can add up (rare in small workshops, real in prod).
Rule of thumb: Endpoints left deployed and notebooks left running are the most common surprise bills in education/research settings.
A daily “shutdown checklist” (use now, automate later)
-
Workbench notebooks — stop the runtime/instance
when you’re done.
-
Custom/HPT jobs — confirm no jobs stuck in
RUNNING
.
-
Endpoints — undeploy models and delete unused
endpoints.
-
Batch predictions — ensure no jobs queued or
running.
-
Artifacts — delete large intermediate artifacts you
won’t reuse.
- GCS — keep only one “source of truth”; avoid duplicate datasets in multiple buckets/regions.
Shutting down Vertex AI Workbench notebooks
Vertex AI has two notebook flavors; follow the matching steps:
Managed Notebooks (recommended for workshops)
Console: Vertex AI → Workbench → Managed notebooks → select runtime → Stop.
Idle shutdown: Edit runtime → enable Idle shutdown (e.g., 60–120 min).
-
CLI:
Cleaning up training, tuning, and batch jobs
Stop/delete as needed
BASH
# Example: cancel a custom job
gcloud ai custom-jobs cancel JOB_ID --region=us-central1
# Delete a completed job you no longer need to retain
gcloud ai custom-jobs delete JOB_ID --region=us-central1
Tip: Keep one “golden” successful job per experiment, then remove the rest to reduce console clutter and artifact storage.
Undeploy models and delete endpoints (major cost pitfall)
Undeploy and delete
BASH
# Undeploy the model from the endpoint (stops node-hour charges)
gcloud ai endpoints undeploy-model ENDPOINT_ID --deployed-model-id=DEPLOYED_MODEL_ID --region=us-central1 --quiet
# Delete the endpoint if you no longer need it
gcloud ai endpoints delete ENDPOINT_ID --region=us-central1 --quiet
Model Registry: If you keep models registered but don’t serve them, you won’t pay endpoint node‑hours. Periodically prune stale model versions to reduce storage.
GCS housekeeping (lifecycle policies, versioning, egress)
Lifecycle policy example
Keep workshop artifacts tidy by auto‑deleting temporary outputs and capping old versions.
- Save as
lifecycle.json
:
JSON
{
"rule": [
{
"action": {"type": "Delete"},
"condition": {"age": 7, "matchesPrefix": ["tmp/"]}
},
{
"action": {"type": "Delete"},
"condition": {"numNewerVersions": 3}
}
]
}
- Apply to bucket:
Labels, budgets, and cost visibility
Standardize labels on all resources
Use the same labels everywhere (notebooks, jobs, buckets) so billing exports can attribute costs.
Examples:
owner=yourname
,team=ml-workshop
,purpose=titanic-demo
,env=dev
-
CLI examples:
Monitoring and alerts (catch leaks quickly)
-
Cloud Monitoring dashboards: Track notebook VM
uptime, endpoint deployment counts, and job error rates.
-
Alerting policies: Trigger notifications when:
- A Workbench runtime has been running > N hours outside workshop hours.
- An endpoint node count > 0 for > 60 minutes after a workshop ends.
- Spend forecast exceeds budget threshold.
Keep alerts few and actionable. Route to email or Slack (via webhook) where your team will see them.
Quotas and guardrails
-
Quotas (IAM & Admin → Quotas): cap GPU count,
custom job limits, and endpoint nodes to protect budgets.
-
IAM: least privilege for service accounts used by
notebooks and jobs; avoid wide
Editor
grants.
- Org policies (if available): disallow costly regions/accelerators you don’t plan to use.
Automating the boring parts
Nightly auto‑stop for idle notebooks
Use Cloud Scheduler to run a daily command that stops notebooks after hours.
BASH
# Cloud Scheduler job (runs daily 22:00) to stop a specific managed runtime
gcloud scheduler jobs create http stop-runtime-job --schedule="0 22 * * *" --uri="https://notebooks.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/runtimes/RUNTIME_NAME:stop" --http-method=POST --oidc-service-account-email=SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com
Alternative: call
gcloud notebooks runtimes list
in a small Cloud Run job, filter bylast_active_time
, and stop any runtime idle > 2h.
Common pitfalls and quick fixes
-
Forgotten endpoints → Undeploy
models; delete endpoints you don’t need.
-
Notebook left running all weekend → Enable
Idle shutdown; schedule nightly stop.
-
Duplicate datasets across buckets/regions →
consolidate; set lifecycle to purge
tmp/
.
-
Too many parallel HPT trials → cap
parallel_trial_count
(2–4) and increasemax_trial_count
gradually.
- Orphaned artifacts in Artifact Registry/GCS → prune old images/artifacts after promoting a single “golden” run.
Challenge 1 — Find and stop idle notebooks
List your notebooks and identify any runtime/instance that has likely been idle for >2 hours. Stop it via CLI.
Hints: gcloud notebooks runtimes list
,
gcloud notebooks instances list
, ... stop
Use gcloud notebooks runtimes list --location=REGION
(Managed) or
gcloud notebooks instances list --location=ZONE
(User‑Managed) to find candidates, then stop them with the corresponding
... stop
command.
Challenge 2 — Write a lifecycle policy
Create and apply a lifecycle rule that (a) deletes objects under
tmp/
after 7 days, and (b) retains only 3 versions of any
object.
Hint:
gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET
Use the JSON policy shown above, then run
gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET
and
verify with gsutil lifecycle get ...
.
Challenge 3 — Endpoint sweep
List deployed endpoints in your region, undeploy any model you don’t need, and delete the endpoint if it’s no longer required.
Hints: gcloud ai endpoints list
,
... describe
, ... undeploy-model
,
... delete
gcloud ai endpoints list --region=REGION
→ pick
ENDPOINT_ID
→
gcloud ai endpoints undeploy-model ENDPOINT_ID --deployed-model-id=DEPLOYED_MODEL_ID --region=REGION --quiet
→ if not needed,
gcloud ai endpoints delete ENDPOINT_ID --region=REGION --quiet
.
- Endpoints and running notebooks are the most common cost leaks; undeploy/stop first.
- Prefer Managed Notebooks with Idle shutdown; schedule nightly auto‑stop.
- Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
- Standardize labels, set budgets, and enable billing export for visibility.
- Use
gcloud
/gsutil
to audit and clean quickly; automate with Scheduler + Cloud Run/Functions.