Content from Overview of Google Cloud Vertex AI
Last updated on 2025-08-27 | Edit this page
Google Cloud Vertex AI is a unified machine learning (ML) platform that enables users to build, train, tune, and deploy models at scale without needing to manage underlying infrastructure. By integrating data storage, training, tuning, and deployment workflows into one managed environment, Vertex AI supports researchers and practitioners in focusing on their ML models while leveraging Google Cloud’s compute and storage resources.
Overview
Questions
- What problem does Google Cloud Vertex AI aim to solve?
- How does Vertex AI simplify machine learning workflows compared to running them on your own?
Objectives
- Understand the basic purpose of Vertex AI in the ML lifecycle.
- Recognize how Vertex AI reduces infrastructure and orchestration overhead.
Why use Vertex AI for machine learning?
Vertex AI provides several advantages that make it an attractive option for research and applied ML:
Streamlined ML/AI Pipelines: Traditional HPC/HTC environments often require researchers to split workflows into many batch jobs, manually handling dependencies and orchestration. Vertex AI reduces this overhead by managing the end-to-end pipeline (data prep, training, evaluation, tuning, and deployment) within a single environment, making it easier to iterate and scale ML experiments.
-
Flexible compute options: Vertex AI lets you select the right hardware for your workload:
-
CPU (e.g., n1-standard-4, e2-standard-8): Good for
small datasets, feature engineering, and inference tasks.
-
GPU (e.g., NVIDIA T4, V100, A100): Optimized for
deep learning training and large-scale experimentation.
- Memory-optimized machine types (e.g., m1-ultramem): Useful for workloads requiring large in-memory datasets, such as transformer models.
-
CPU (e.g., n1-standard-4, e2-standard-8): Good for
small datasets, feature engineering, and inference tasks.
Parallelized training and tuning: Vertex AI supports distributed training across multiple nodes and automated hyperparameter tuning (Bayesian or grid search). This makes it easier to explore many configurations with minimal custom code while leveraging scalable infrastructure.
Custom training support: Vertex AI includes built-in algorithms and frameworks (e.g., scikit-learn, XGBoost, TensorFlow, PyTorch), but it also supports custom containers. Researchers can bring their own scripts or Docker images to run specialized workflows with full control.
Cost management and monitoring: Google Cloud provides detailed cost tracking and monitoring via the Billing console and Vertex AI dashboard. Vertex AI also integrates with Cloud Monitoring to help track resource usage. With careful configuration, training 100 small-to-medium models (logistic regression, random forests, or lightweight neural networks on datasets under 10GB) can cost under $20, similar to AWS.
In summary, Vertex AI is Google Cloud’s managed machine learning platform that simplifies the end-to-end ML lifecycle. It eliminates the need for manual orchestration in research computing environments by offering integrated workflows, scalable compute, and built-in monitoring. With flexible options for CPUs, GPUs, and memory-optimized hardware, plus strong support for both built-in and custom training, Vertex AI enables researchers to move quickly from experimentation to production while keeping costs predictable and manageable.
Infrastructure Choices for ML
At your institution (or in your own work), what infrastructure
options are currently available for running ML experiments?
- Do you typically use a laptop/desktop, HPC cluster, or cloud?
- What are the advantages and drawbacks of your current setup compared
to a managed service like Vertex AI?
- If you could offload one infrastructure challenge (e.g., provisioning
GPUs, handling dependencies, monitoring costs), what would it be and
why?
Take 3–5 minutes to discuss with a partner or share in the workshop chat.
- Vertex AI simplifies ML workflows by integrating data, training,
tuning, and deployment in one managed platform.
- It reduces the need for manual orchestration compared to traditional
research computing environments.
- Cost monitoring and resource tracking help keep cloud usage affordable for research projects.
Content from Data Storage: Setting up GCS
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How can I store and manage data effectively in GCP for Vertex AI
workflows?
- What are the advantages of Google Cloud Storage (GCS) compared to local or VM storage for machine learning projects?
Objectives
- Explain data storage options in GCP for machine learning
projects.
- Describe the advantages of GCS for large datasets and collaborative
workflows.
- Outline steps to set up a GCS bucket and manage data within Vertex AI.
Storing data on GCP
Machine learning and AI projects rely on data, making efficient storage and management essential. Google Cloud offers several storage options, but the most common for ML workflows are persistent disks (attached to Compute Engine VMs or Vertex AI Workbench) and Google Cloud Storage (GCS) buckets.
Consult your institution’s IT before handling sensitive data in GCP
As with AWS, do not upload restricted or sensitive data to GCP services unless explicitly approved by your institution’s IT or cloud security team. For regulated datasets (HIPAA, FERPA, proprietary), work with your institution to ensure encryption, restricted access, and compliance with policies.
Options for storage: VM Disks or GCS
What is a VM persistent disk?
A persistent disk is the storage volume attached to a Compute Engine VM or a Vertex AI Workbench notebook. It can store datasets and intermediate results, but it is tied to the lifecycle of the VM.
When to store data directly on a persistent disk
- Useful for small, temporary datasets processed interactively.
- Data persists if the VM is stopped, but storage costs continue as
long as the disk exists.
- Not ideal for collaboration, scaling, or long-term dataset storage.
Limitations of persistent disk storage
-
Scalability: Limited by disk size quota.
-
Sharing: Harder to share across projects or team
members.
- Cost: More expensive per GB compared to GCS for long-term storage.
What is a GCS bucket?
For most ML workflows in Vertex AI, Google Cloud Storage
(GCS) buckets are recommended. A GCS bucket is a container in
Google’s object storage service where you can store an essentially
unlimited number of files. Data in GCS can be accessed from Vertex AI
training jobs, Workbench notebooks, and other GCP services using a
GCS URI (e.g.,
gs://your-bucket-name/your-file.csv
).
Benefits of using GCS (recommended for ML workflows)
-
Separation of storage and compute: Data remains
available even if VMs or notebooks are deleted.
-
Easy sharing: Buckets can be accessed by
collaborators with the right IAM roles.
-
Integration with Vertex AI and BigQuery: Read and
write data directly in pipelines.
-
Scalability: Handles datasets of any size without
disk limits.
-
Cost efficiency: Lower cost than persistent disks
for long-term storage.
- Data persistence: Durable and highly available across regions.
Recommended approach: GCS buckets
To upload our Titanic dataset to a GCS bucket, we’ll follow these steps:
- Log in to the Google Cloud Console.
- Create a new bucket (or use an existing one).
- Upload your dataset files.
- Use the GCS URI to reference your data in Vertex AI workflows.
Detailed procedure
1. Sign in to Google Cloud Console
- Go to console.cloud.google.com and log in with your credentials.
3. Create a new bucket
- Click Create bucket.
- Enter a globally unique name (e.g.,
yourname-titanic-gcs
).
- Choose a location type:
-
Region (cheapest, good default).
-
Multi-region (higher redundancy, more
expensive).
-
Region (cheapest, good default).
-
Access Control: Recommended: Uniform access with
IAM.
-
Public Access: Block public access unless
explicitly needed.
-
Versioning: Disable unless you want to keep
multiple versions of files.
-
Labels (tags): Add labels to track project usage
(e.g.,
purpose=titanic-dataset
,owner=yourname
).
4. Set bucket permissions
- By default, only project members can access.
- To grant Vertex AI service accounts access, assign the
Storage Object Admin
orStorage Object Viewer
role at the bucket level.
GCS bucket costs
GCS costs are based on storage class, data transfer, and operations (requests).
Storage costs
- Standard storage (us-central1): ~$0.02 per GB per month.
- Other classes (Nearline, Coldline, Archive) are cheaper but with retrieval costs.
Data transfer costs
- Uploading data into GCS is free.
- Downloading data out of GCP costs ~$0.12 per GB.
- Accessing data within the same region is free.
Request costs
-
GET
(read) requests: ~$0.004 per 10,000 requests.
-
PUT
(write) requests: ~$0.05 per 10,000 requests.
For detailed pricing, see GCS Pricing Information.
Challenge: Estimating Storage Costs
1. Estimate the total cost of storing 1 GB in GCS Standard
storage (us-central1) for one month assuming:
- Storage duration: 1 month
- Dataset retrieved 100 times for model training and tuning
- Data is downloaded once out of GCP at the end of the project
Hints
- Storage cost: $0.02 per GB per month
- Egress (download out of GCP): $0.12 per GB
- GET
requests: $0.004 per 10,000 requests (100 requests ≈
free for our purposes)
2. Repeat the above calculation for datasets of 10 GB, 100 GB, and 1 TB (1024 GB).
-
1 GB:
- Storage: 1 GB × $0.02 = $0.02
- Egress: 1 GB × $0.12 = $0.12
- Requests: ~0 (100 reads well below pricing tier)
- Total: $0.14
-
10 GB:
- Storage: 10 GB × $0.02 = $0.20
- Egress: 10 GB × $0.12 = $1.20
- Requests: ~0
- Total: $1.40
-
100 GB:
- Storage: 100 GB × $0.02 = $2.00
- Egress: 100 GB × $0.12 = $12.00
- Requests: ~0
- Total: $14.00
-
1 TB (1024 GB):
- Storage: 1024 GB × $0.02 = $20.48
- Egress: 1024 GB × $0.12 = $122.88
- Requests: ~0
- Total: $143.36
Removing unused data (complete after the workshop)
After you are done using your data, remove unused files/buckets to stop costs:
-
Option 1: Delete files only – if you plan to reuse
the bucket.
- Option 2: Delete the bucket entirely – if you no longer need it.
- Use GCS for scalable, cost-effective, and persistent storage in
GCP.
- Persistent disks are suitable only for small, temporary
datasets.
- Track your storage, transfer, and request costs to manage
expenses.
- Regularly delete unused data or buckets to avoid ongoing costs.
Content from Notebooks as Controllers
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How do you set up and use Vertex AI Workbench notebooks for machine
learning tasks?
- How can you manage compute resources efficiently using a “controller” notebook approach in GCP?
Objectives
- Describe how to use Vertex AI Workbench notebooks for ML
workflows.
- Set up a Jupyter-based Workbench instance as a controller to manage
compute tasks.
- Use the Vertex AI SDK to launch training and tuning jobs on scalable instances.
Setting up our notebook environment
Google Cloud Vertex AI provides a managed environment for building, training, and deploying machine learning models. In this episode, we’ll set up a Vertex AI Workbench notebook instance—a Jupyter-based environment hosted on GCP that integrates seamlessly with other Vertex AI services.
Using the notebook as a controller
The notebook instance functions as a controller to manage
more resource-intensive tasks. By selecting a modest machine type (e.g.,
n1-standard-4
), you can perform lightweight operations
locally in the notebook while using the Vertex AI Python
SDK to launch compute-heavy jobs on larger machines (e.g.,
GPU-accelerated) when needed.
This approach minimizes costs while giving you access to scalable infrastructure for demanding tasks like model training, batch prediction, and hyperparameter tuning.
We’ll follow these steps to create our first Vertex AI Workbench notebook:
1. Navigate to Vertex AI Workbench
- In the Google Cloud Console, search for Vertex AI
Workbench.
- Pin it to your navigation bar for quick access.
2. Create a new notebook instance
- Click New Notebook.
- Choose Managed Notebooks (recommended for workshops
and shared environments).
-
Notebook name: Use a naming convention like
yourname-explore-vertexai
.
-
Machine type: Select a small machine (e.g.,
n1-standard-4
) to act as the controller.- This keeps costs low while you delegate heavy lifting to Vertex AI
training jobs.
- For guidance on common machine types for ML procedures, refer to our
supplemental Instances for ML on
GCP.
- This keeps costs low while you delegate heavy lifting to Vertex AI
training jobs.
-
GPUs: Leave disabled for now (training jobs will
request them separately).
-
Permissions: The project’s default service account
is usually sufficient. It must have access to GCS and Vertex AI.
-
Networking and encryption: Leave default unless
required by your institution.
-
Labels: Add labels for cost tracking (e.g.,
purpose=workshop
,owner=yourname
).
Once created, your notebook instance will start in a few minutes. When its status is Running, you can open JupyterLab and begin working.
Managing training and tuning with the controller notebook
In the following episodes, we’ll use the Vertex AI Python SDK
(google-cloud-aiplatform
) from this notebook to
submit compute-heavy tasks on more powerful machines. Examples
include:
-
Training a model: Submit a training job to Vertex
AI with a higher-powered instance (e.g.,
n1-highmem-32
or GPU-backed machines).
- Hyperparameter tuning: Configure and submit a tuning job, allowing Vertex AI to manage multiple parallel trials automatically.
This pattern keeps costs low by running your notebook on a modest VM while only incurring charges for larger resources when they’re actively in use.
Challenge: Notebook Roles
Your university provides different compute options: laptops, on-prem HPC, and GCP.
- What role does a Vertex AI Workbench notebook play
compared to an HPC login node or a laptop-based JupyterLab?
- Which tasks should stay in the notebook (lightweight control, visualization) versus being launched to larger cloud resources?
The notebook serves as a lightweight control plane.
- Like an HPC login node, it’s not meant for heavy computation.
- Suitable for small preprocessing, visualization, and orchestrating
jobs.
- Resource-intensive tasks (training, tuning, batch jobs) should be
submitted to scalable cloud resources (GPU/large VM instances) via the
Vertex AI SDK.
- Use a small Vertex AI Workbench notebook instance as a controller to
manage larger, resource-intensive tasks.
- Submit training and tuning jobs to scalable instances using the
Vertex AI SDK.
- Labels help track costs effectively, especially in shared or
multi-project environments.
- Vertex AI Workbench integrates directly with GCS and Vertex AI services, making it a hub for ML workflows.
Content from Accessing and Managing Data in GCS with Vertex AI Notebooks
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How can I load data from GCS into a Vertex AI Workbench
notebook?
- How do I monitor storage usage and costs for my GCS bucket?
- What steps are involved in pushing new data back to GCS from a notebook?
Objectives
- Read data directly from a GCS bucket into memory in a Vertex AI
notebook.
- Check storage usage and estimate costs for data in a GCS
bucket.
- Upload new files from the Vertex AI environment back to the GCS bucket.
Initial setup
Open JupyterLab notebook
Once your Vertex AI Workbench notebook instance shows as
Running, open it in JupyterLab. Create a new Python 3
notebook and rename it to: Interacting-with-GCS.ipynb
.
Set up GCP environment
Before interacting with GCS, we need to authenticate and initialize the client libraries. This ensures our notebook can talk to GCP securely.
PYTHON
from google.cloud import storage
from google.colab import auth
import pandas as pd
# Step 1: Authenticate your account (only prompts if needed)
auth.authenticate_user()
# Step 2: Initialize a GCS client
client = storage.Client()
# Step 3: List buckets in your current project to confirm access
buckets = list(client.list_buckets())
print("Buckets in project:")
for b in buckets:
print("-", b.name)
Explanation of the pieces:
- auth.authenticate_user()
: Ensures you are logged in to
your Google account and the notebook can act on your behalf. In
Workbench, this usually auto-resolves.
- storage.Client()
: Creates a connection to Google Cloud
Storage. All read/write actions will use this client.
- list_buckets()
: Confirms which storage buckets your
account can see in the current project.
This setup block prepares the notebook environment to efficiently interact with GCS resources.
Reading data from GCS
As with S3, you can either (A) read data directly from GCS into memory, or (B) download a copy into your notebook VM. Since we’re using notebooks as controllers rather than training environments, the recommended approach is reading directly from GCS.
Checking storage usage of a bucket
Estimating storage costs
PYTHON
storage_price_per_gb = 0.02 # $/GB/month for Standard storage
total_size_gb = total_size_bytes / (1024**3)
monthly_cost = total_size_gb * storage_price_per_gb
print(f"Estimated monthly cost: ${monthly_cost:.4f}")
print(f"Estimated annual cost: ${monthly_cost*12:.4f}")
For updated prices, see GCS Pricing.
Writing output files to GCS
PYTHON
# Create a sample file
with open("Notes.txt", "w") as f:
f.write("This is a test note for GCS.")
# Upload to bucket/docs/
bucket = client.bucket(bucket_name)
blob = bucket.blob("docs/Notes.txt")
blob.upload_from_filename("Notes.txt")
print("File uploaded successfully.")
List bucket contents:
Challenge: Estimating GCS Costs
Suppose you store 50 GB of data in Standard storage
(us-central1) for one month.
- Estimate the monthly storage cost.
- Then estimate the cost if you download (egress) the entire dataset
once at the end of the month.
Hints
- Storage: $0.02 per GB-month
- Egress: $0.12 per GB
- Storage cost: 50 GB × $0.02 = $1.00
- Egress cost: 50 GB × $0.12 = $6.00
- Total cost: $7.00 for one month including one full download
- Load data from GCS into memory to avoid managing local copies when
possible.
- Periodically check storage usage and costs to manage your GCS
budget.
- Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.
Content from Using a GitHub Personal Access Token (PAT) to Push/Pull from a Vertex AI Notebook
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How can I securely push/pull code to and from GitHub within a Vertex
AI Workbench notebook?
- What steps are necessary to set up a GitHub PAT for authentication
in GCP?
- How can I convert notebooks to
.py
files and ignore.ipynb
files in version control?
Objectives
- Configure Git in a Vertex AI Workbench notebook to use a GitHub
Personal Access Token (PAT) for HTTPS-based authentication.
- Securely handle credentials in a notebook environment using
getpass
.
- Convert
.ipynb
files to.py
files for better version control practices in collaborative projects.
Step 0: Initial setup
In the previous episode, we cloned our forked repository as part of the workshop setup. In this episode, we’ll see how to push our code to this fork. Complete these three setup steps before moving forward.
Clone the fork if you haven’t already. See previous episode.
Start a new Jupyter notebook, and name it something like
Interacting-with-git.ipynb
. We can use the default Python 3 kernel in Vertex AI Workbench.Change directory to the workspace where your repository is located. In Vertex AI Workbench, notebooks usually live under
/home/jupyter/
.
Step 1: Using a GitHub personal access token (PAT) to push/pull from a Vertex AI notebook
When working in Vertex AI Workbench notebooks, you may often need to push code updates to GitHub repositories. Since Workbench VMs may be stopped and restarted, configurations like SSH keys may not persist. HTTPS-based authentication with a GitHub Personal Access Token (PAT) is a practical solution. PATs provide flexibility for authentication and enable seamless interaction with both public and private repositories directly from your notebook.
Important Note: Personal access tokens are powerful credentials. Select the minimum necessary permissions and handle the token carefully.
Generate a personal access token (PAT) on GitHub
- Go to Settings in GitHub.
- Click Developer settings at the bottom of the left
sidebar.
- Select Personal access tokens, then click
Tokens (classic).
- Click Generate new token (classic).
- Give your token a descriptive name and set an expiration date if
desired.
-
Select minimum permissions:
- Public repos:
public_repo
- Private repos:
repo
- Public repos:
- Click Generate token and copy it immediately—you won’t be able to see it again.
Caution: Treat your PAT like a password. Don’t share it or expose it in your code. Use a password manager to store it.
Step 2: Configure Git settings
PYTHON
!git config --global user.name "Your Name"
!git config --global user.email your_email@wisc.edu
-
user.name
: Will appear in the commit history.
-
user.email
: Must match your GitHub account so commits are linked to your profile.
Step 3: Convert .ipynb
notebooks to
.py
Tracking .py
files instead of .ipynb
helps
with cleaner version control. Notebooks store outputs and metadata,
which makes diffs noisy. .py
files are lighter and easier
to review.
- Install Jupytext.
- Convert a notebook to
.py
.
- Convert all notebooks in the current directory.
Step 4: Add and commit .py
files
Step 5: Add .ipynb
to .gitignore
PYTHON
!touch .gitignore
with open(".gitignore", "a") as gitignore:
gitignore.write("\n# Ignore Jupyter notebooks\n*.ipynb\n")
!cat .gitignore
Add other temporary files too:
PYTHON
with open(".gitignore", "a") as gitignore:
gitignore.write("\n# Ignore cache and temp files\n__pycache__/\n*.tmp\n*.log\n")
Commit the .gitignore
:
Step 6: Syncing with GitHub
First, pull the latest changes:
If conflicts occur, resolve manually before committing.
Then push with your PAT credentials:
Step 7: Convert .py
back to notebooks (optional)
To convert .py
files back to .ipynb
after
pulling updates:
Challenge: GitHub PAT Workflow
- Why might you prefer using a PAT with HTTPS instead of SSH keys in
Vertex AI Workbench?
- What are the benefits of converting
.ipynb
files to.py
before committing to a shared repo?
- PATs with HTTPS are easier to set up in temporary environments where
SSH configs don’t persist.
- Converting notebooks to
.py
results in cleaner diffs, easier code review, and smaller repos without stored outputs/metadata.
- Use a GitHub PAT for HTTPS-based authentication in Vertex AI
Workbench notebooks.
- Securely enter sensitive information in notebooks using
getpass
.
- Converting
.ipynb
files to.py
files helps with cleaner version control.
- Adding
.ipynb
files to.gitignore
keeps your repository organized.
Content from Training Models in Vertex AI: Intro
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- What are the differences between training locally in a Vertex AI
notebook and using Vertex AI-managed training jobs?
- How do custom training jobs in Vertex AI streamline the training
process for various frameworks?
- How does Vertex AI handle scaling across CPUs, GPUs, and TPUs?
Objectives
- Understand the difference between local training in a Vertex AI
Workbench notebook and submitting managed training jobs.
- Learn to configure and use Vertex AI custom training jobs for
different frameworks (e.g., XGBoost, PyTorch, SKLearn).
- Understand scaling options in Vertex AI, including when to use CPUs,
GPUs, or TPUs.
- Compare performance, cost, and setup between custom scripts and
pre-built containers in Vertex AI.
- Conduct training with data stored in GCS and monitor training job status using the Google Cloud Console.
Initial setup
1. Open a new .ipynb notebook
Open a fresh Jupyter notebook inside your Vertex AI Workbench
instance. You can name it something along the lines of,
Training-models.ipynb
.
2. CD to instance home directory
So we all can reference helper functions consistently, change directory to your Jupyter home directory.
3. Initialize Vertex AI environment
This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.
PYTHON
from google.cloud import aiplatform
import pandas as pd
# Set your project and region (replace with your values)
PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-gcs-bucket"
# Initialize Vertex AI client
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")
-
aiplatform.init()
: Sets defaults for project, region, and staging bucket.
-
PROJECT_ID
: Identifies your GCP project.
-
REGION
: Determines where training jobs run (choose a region close to your data).
-
staging_bucket
: A GCS bucket for storing datasets, model artifacts, and job outputs.
Testing train.py locally in the notebook
Before scaling training jobs onto managed resources, it’s essential to test your training script locally. This prevents wasting GPU/TPU time on bugs or misconfigured code.
Guidelines for testing ML pipelines before scaling
-
Run tests locally first with small datasets.
-
Use a subset of your dataset (1–5%) for fast
checks.
-
Start with minimal compute before moving to larger
accelerators.
-
Log key metrics such as loss curves and
runtimes.
- Verify correctness first before scaling up.
What tests should we do before scaling?
Before scaling to multiple or more powerful instances (e.g., GPUs or TPUs), it’s important to run a few sanity checks. In your group, discuss:
- Which checks do you think are most critical before scaling up?
- What potential issues might we miss if we skip this step?
-
Data loads correctly – dataset loads without
errors, expected columns exist, missing values handled.
-
Overfitting check – train on a tiny dataset (e.g.,
100 rows). If it doesn’t overfit, something is off.
-
Loss behavior – verify training loss decreases and
doesn’t diverge.
-
Runtime estimate – get a rough sense of training
time on small data.
-
Memory estimate – check approximate memory
use.
- Save & reload – ensure model saves, reloads, and infers without errors.
Skipping these can lead to: silent data bugs, runtime blowups at scale, inefficient experiments, or broken model artifacts.
Download data into notebook environment
Sometimes it’s helpful to keep a copy of data in your notebook VM for quick iteration, even though GCS is the preferred storage location.
PYTHON
from google.cloud import storage
client = storage.Client()
bucket = client.bucket(BUCKET_NAME)
blob = bucket.blob("titanic_train.csv")
blob.download_to_filename("titanic_train.csv")
print("Downloaded titanic_train.csv")
Repeat for the test dataset as needed.
Logging runtime & instance info
When comparing runtimes later, it’s useful to know what instance type you ran on. For Workbench:
This prints the machine type backing your notebook.
Local test run of train.py
PYTHON
import time as t
start = t.time()
# Example: run your custom training script with args
!python GCP_helpers/train_xgboost.py --max_depth 3 --eta 0.1 --subsample 0.8 --colsample_bytree 0.8 --num_round 100 --train titanic_train.csv
print(f"Total local runtime: {t.time() - start:.2f} seconds")
Training on this small dataset should take <1 minute. Log runtime as a baseline.
Training via Vertex AI custom training job
Unlike “local” training, this launches a managed training job that runs on scalable compute. Vertex AI handles provisioning, scaling, logging, and saving outputs to GCS.
Which machine type to start with?
Start with a small CPU machine like n1-standard-4
. Only
scale up to GPUs/TPUs once you’ve verified your script. See Instances for ML on GCP for
guidance.
Creating a custom training job with the SDK
PYTHON
from google.cloud import aiplatform
job = aiplatform.CustomJob(
display_name="xgboost-train",
script_path="GCP_helpers/train_xgboost.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.1-5:latest",
requirements=["pandas", "scikit-learn", "joblib"],
args=[
"--max_depth=3",
"--eta=0.1",
"--subsample=0.8",
"--colsample_bytree=0.8",
"--num_round=100",
"--train=gs://{}/titanic_train.csv".format(BUCKET_NAME),
],
model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-5:latest",
)
# Run the training job
model = job.run(replica_count=1, machine_type="n1-standard-4")
This launches a managed training job with Vertex AI. Logs and trained models are automatically stored in your GCS bucket.
Monitoring training jobs in the Console
- Go to the Google Cloud Console.
- Navigate to Vertex AI > Training > Custom
Jobs.
- Click on your job name to see status, logs, and output model
artifacts.
- Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).
When training takes too long
Two main options in Vertex AI:
-
Option 1: Upgrade to more powerful machine types
(e.g., add GPUs like T4, V100, A100).
- Option 2: Use distributed training with multiple replicas.
Option 1: Upgrade machine type (preferred first step)
- Works best for small/medium datasets (<10 GB).
- Avoids the coordination overhead of distributed training.
- GPUs/TPUs accelerate deep learning tasks significantly.
Option 2: Distributed training with multiple replicas
- Supported in Vertex AI for many frameworks.
- Split data across replicas, each trains a portion, gradients
synchronized.
- More beneficial for very large datasets and long-running jobs.
When distributed training makes sense
- Dataset >10–50 GB.
- Training time >10 hours on single machine.
- Deep learning workloads that naturally parallelize across GPUs/TPUs.
-
Environment initialization: Use
aiplatform.init()
to set defaults for project, region, and bucket.
-
Local vs managed training: Test locally before
scaling into managed jobs.
-
Custom jobs: Vertex AI lets you run scripts as
managed training jobs using pre-built or custom containers.
-
Scaling: Start small, then scale up to GPUs or
distributed jobs as dataset/model size grows.
- Monitoring: Track job logs and artifacts in the Vertex AI Console.
Content from Training Models in Vertex AI: PyTorch Example
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- When should you consider using a GPU or TPU instance for training
neural networks in Vertex AI, and what are the benefits and
limitations?
- How does Vertex AI handle distributed training, and which approaches are suitable for typical neural network training?
Objectives
- Preprocess the Titanic dataset for efficient training using
PyTorch.
- Save and upload training and validation data in
.npz
format to GCS.
- Understand the trade-offs between CPU, GPU, and TPU training for
smaller datasets.
- Deploy a PyTorch model to Vertex AI and evaluate instance types for
training performance.
- Differentiate between data parallelism and model parallelism, and determine when each is appropriate in Vertex AI.
Initial setup
Open a fresh Jupyter notebook in your Vertex AI Workbench environment
(e.g., Training-part2.ipynb
). Then initialize your
environment:
PYTHON
from google.cloud import aiplatform, storage
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-gcs-bucket"
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")
-
aiplatform.init()
: Initializes Vertex AI with project, region, and staging bucket.
-
storage.Client()
: Used to upload training data to GCS.
Preparing the data (compressed npz files)
We’ll prepare the Titanic dataset and save as .npz
files
for efficient PyTorch loading.
PYTHON
# Load and preprocess Titanic dataset
df = pd.read_csv("titanic_train.csv")
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
df['Embarked'] = df['Embarked'].fillna('S')
df['Embarked'] = LabelEncoder().fit_transform(df['Embarked'])
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Fare'] = df['Fare'].fillna(df['Fare'].median())
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']].values
y = df['Survived'].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
np.savez('train_data.npz', X_train=X_train, y_train=y_train)
np.savez('val_data.npz', X_val=X_val, y_val=y_val)
Upload data to GCS
PYTHON
client = storage.Client()
bucket = client.bucket(BUCKET_NAME)
bucket.blob("train_data.npz").upload_from_filename("train_data.npz")
bucket.blob("val_data.npz").upload_from_filename("val_data.npz")
print("Files uploaded to GCS.")
Why use .npz
?
-
Optimized data loading: Compressed binary format
reduces I/O overhead.
-
Batch compatibility: Works seamlessly with PyTorch
DataLoader
.
-
Consistency: Keeps train/validation arrays
structured and organized.
-
Multiple arrays: Stores multiple arrays
(
X_train
,y_train
) in one file.
Testing locally in notebook
Before scaling up, test your script locally with fewer epochs:
Training via Vertex AI with PyTorch
Vertex AI supports custom training jobs with PyTorch containers.
PYTHON
job = aiplatform.CustomJob(
display_name="pytorch-train",
script_path="GCP_helpers/train_nn.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
requirements=["torch", "pandas", "numpy", "scikit-learn"],
args=[
"--train=gs://{}/train_data.npz".format(BUCKET_NAME),
"--val=gs://{}/val_data.npz".format(BUCKET_NAME),
"--epochs=1000",
"--learning_rate=0.001"
],
model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-13:latest",
)
model = job.run(replica_count=1, machine_type="n1-standard-4")
GPU Training in Vertex AI
For small datasets, GPUs may not help. But for larger models/datasets, GPUs (e.g., T4, V100, A100) can reduce training time.
In your training script (train_nn.py
), ensure GPU
support:
Then move models and tensors to device
.
Distributed training in Vertex AI
Vertex AI supports data and model parallelism.
-
Data parallelism: Common for neural nets; dataset
split across replicas; gradients synced.
- Model parallelism: Splits model across devices, used for very large models.
Monitoring jobs
- In the Console: Vertex AI > Training > Custom
Jobs.
- Check logs, runtime, and outputs.
- Cancel jobs as needed.
-
.npz
files streamline PyTorch data handling and reduce I/O overhead.
- GPUs may not speed up small models/datasets due to overhead.
- Vertex AI supports both CPU and GPU training, with scaling via
multiple replicas.
- Data parallelism splits data, model parallelism splits layers —
choose based on model size.
- Test locally first before launching expensive training jobs.
Content from Hyperparameter Tuning in Vertex AI: Neural Network Example
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How can we efficiently manage hyperparameter tuning in Vertex
AI?
- How can we parallelize tuning jobs to optimize time without increasing costs?
Objectives
- Set up and run a hyperparameter tuning job in Vertex AI.
- Define search spaces for
ContinuousParameter
andCategoricalParameter
.
- Log and capture objective metrics for evaluating tuning
success.
- Optimize tuning setup to balance cost and efficiency, including parallelization.
To conduct efficient hyperparameter tuning with neural networks (or any model) in Vertex AI, we’ll use Vertex AI’s Hyperparameter Tuning Jobs. The key is defining a clear search space, ensuring metrics are properly logged, and keeping costs manageable by controlling the number of trials and level of parallelization.
Key steps for hyperparameter tuning
The overall process involves these steps:
- Prepare training script and ensure metrics are logged.
- Define hyperparameter search space.
- Configure a hyperparameter tuning job in Vertex AI.
- Set data paths and launch the tuning job.
- Monitor progress in the Vertex AI Console.
- Extract best model and evaluate.
1. Prepare training script with metric logging
Your training script (train_nn.py
) should periodically
print validation accuracy in a format that Vertex AI can capture.
PYTHON
if (epoch + 1) % 100 == 0 or epoch == epochs - 1:
print(f"validation_accuracy: {val_accuracy:.4f}", flush=True)
Vertex AI automatically captures metrics logged in this format
(key: value
).
2. Define hyperparameter search space
In Vertex AI, you specify hyperparameter ranges when configuring the tuning job. You can define both discrete and continuous ranges.
PYTHON
parameter_spec = {
"epochs": aiplatform.hyperparameter_tuning_utils.IntegerParameterSpec(min=100, max=1000, scale="linear"),
"learning_rate": aiplatform.hyperparameter_tuning_utils.DoubleParameterSpec(min=0.001, max=0.1, scale="log")
}
-
IntegerParameterSpec: Defines integer ranges.
- DoubleParameterSpec: Defines continuous ranges, with optional scaling.
3. Configure hyperparameter tuning job
PYTHON
from google.cloud import aiplatform
job = aiplatform.CustomJob(
display_name="pytorch-train-hpt",
script_path="GCP_helpers/train_nn.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
requirements=["torch", "pandas", "numpy", "scikit-learn"],
model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-13:latest",
)
hpt_job = aiplatform.HyperparameterTuningJob(
display_name="pytorch-hpt-job",
custom_job=job,
metric_spec={"validation_accuracy": "maximize"},
parameter_spec=parameter_spec,
max_trial_count=4,
parallel_trial_count=2,
)
4. Launch the hyperparameter tuning job
PYTHON
hpt_job.run(
machine_type="n1-standard-4",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
args=[
"--train=gs://{}/train_data.npz".format(BUCKET_NAME),
"--val=gs://{}/val_data.npz".format(BUCKET_NAME),
"--epochs=100",
"--learning_rate=0.001"
]
)
-
max_trial_count: Total number of configurations
tested.
- parallel_trial_count: Number of trials run at once (recommend ≤4 to let adaptive search improve).
5. Monitor tuning job in Vertex AI Console
- Navigate to Vertex AI > Training > Hyperparameter
tuning jobs.
- View trial progress, logs, and metrics.
- Cancel jobs from the console if needed.
6. Extract and evaluate the best model
PYTHON
best_trial = hpt_job.trials[0] # Best trial listed first after completion
print("Best hyperparameters:", best_trial.parameters)
print("Best objective value:", best_trial.final_measurement.metrics)
You can then load the best model artifact from the associated GCS path and evaluate on test data.
What is the effect of parallelism in tuning?
- How might running 10 trials in parallel differ from running 2 at a
time in terms of cost, time, and quality of results?
- When would you want to prioritize speed over adaptive search benefits?
- Vertex AI Hyperparameter Tuning Jobs let you efficiently explore
parameter spaces using adaptive strategies.
- Always test with
max_trial_count=1
first to confirm your setup works.
- Limit
parallel_trial_count
to a small number (2–4) to benefit from adaptive search.
- Use GCS for input/output and monitor jobs through the Vertex AI Console.
Content from Resource Management & Monitoring on Vertex AI (GCP)
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- How do I monitor and control Vertex AI, Workbench, and GCS costs day‑to‑day?
- What specifically should I stop, delete, or schedule to avoid surprise charges?
- How can I automate cleanup and set alerting so leaks get caught quickly?
Objectives
- Identify all major cost drivers across Vertex AI (training jobs, endpoints, Workbench notebooks, batch prediction) and GCS.
- Practice safe cleanup for Managed and User‑Managed Workbench notebooks, training/tuning jobs, batch predictions, models, endpoints, and artifacts.
- Configure budgets, labels, and basic lifecycle policies to keep costs predictable.
- Use
gcloud
/gsutil
commands for auditing and rapid cleanup; understand when to prefer the Console. - Draft simple automation patterns (Cloud Scheduler +
gcloud
) to enforce idle shutdown.
What costs you money on GCP (quick map)
- Vertex AI training jobs (Custom Jobs, Hyperparameter Tuning Jobs) — billed per VM/GPU hour while running.
- Vertex AI endpoints (online prediction) — billed per node‑hour 24/7 while deployed, even if idle.
- Vertex AI batch prediction jobs — billed for the job’s compute while running.
- Vertex AI Workbench notebooks — the backing VM and disk bill while running (and disks bill even when stopped).
- GCS buckets — storage class, object count/size, versioning, egress, and request ops.
- Artifact Registry (containers, models) — storage for images and large artifacts.
- Network egress — downloading data out of GCP (e.g., to your laptop) incurs cost.
- Logging/Monitoring — high‑volume logs/metrics can add up (rare in small workshops, real in prod).
Rule of thumb: Endpoints left deployed and notebooks left running are the most common surprise bills in education/research settings.
A daily “shutdown checklist” (use now, automate later)
-
Workbench notebooks — stop the runtime/instance
when you’re done.
-
Custom/HPT jobs — confirm no jobs stuck in
RUNNING
.
-
Endpoints — undeploy models and delete unused
endpoints.
-
Batch predictions — ensure no jobs queued or
running.
-
Artifacts — delete large intermediate artifacts you
won’t reuse.
- GCS — keep only one “source of truth”; avoid duplicate datasets in multiple buckets/regions.
Shutting down Vertex AI Workbench notebooks
Vertex AI has two notebook flavors; follow the matching steps:
Managed Notebooks (recommended for workshops)
Console: Vertex AI → Workbench → Managed notebooks → select runtime → Stop.
Idle shutdown: Edit runtime → enable Idle shutdown (e.g., 60–120 min).
-
CLI:
Cleaning up training, tuning, and batch jobs
Stop/delete as needed
BASH
# Example: cancel a custom job
gcloud ai custom-jobs cancel JOB_ID --region=us-central1
# Delete a completed job you no longer need to retain
gcloud ai custom-jobs delete JOB_ID --region=us-central1
Tip: Keep one “golden” successful job per experiment, then remove the rest to reduce console clutter and artifact storage.
Undeploy models and delete endpoints (major cost pitfall)
Undeploy and delete
BASH
# Undeploy the model from the endpoint (stops node-hour charges)
gcloud ai endpoints undeploy-model ENDPOINT_ID --deployed-model-id=DEPLOYED_MODEL_ID --region=us-central1 --quiet
# Delete the endpoint if you no longer need it
gcloud ai endpoints delete ENDPOINT_ID --region=us-central1 --quiet
Model Registry: If you keep models registered but don’t serve them, you won’t pay endpoint node‑hours. Periodically prune stale model versions to reduce storage.
GCS housekeeping (lifecycle policies, versioning, egress)
Lifecycle policy example
Keep workshop artifacts tidy by auto‑deleting temporary outputs and capping old versions.
- Save as
lifecycle.json
:
JSON
{
"rule": [
{
"action": {"type": "Delete"},
"condition": {"age": 7, "matchesPrefix": ["tmp/"]}
},
{
"action": {"type": "Delete"},
"condition": {"numNewerVersions": 3}
}
]
}
- Apply to bucket:
Labels, budgets, and cost visibility
Standardize labels on all resources
Use the same labels everywhere (notebooks, jobs, buckets) so billing exports can attribute costs.
Examples:
owner=yourname
,team=ml-workshop
,purpose=titanic-demo
,env=dev
-
CLI examples:
Monitoring and alerts (catch leaks quickly)
-
Cloud Monitoring dashboards: Track notebook VM
uptime, endpoint deployment counts, and job error rates.
-
Alerting policies: Trigger notifications when:
- A Workbench runtime has been running > N hours outside workshop hours.
- An endpoint node count > 0 for > 60 minutes after a workshop ends.
- Spend forecast exceeds budget threshold.
Keep alerts few and actionable. Route to email or Slack (via webhook) where your team will see them.
Quotas and guardrails
-
Quotas (IAM & Admin → Quotas): cap GPU count,
custom job limits, and endpoint nodes to protect budgets.
-
IAM: least privilege for service accounts used by
notebooks and jobs; avoid wide
Editor
grants.
- Org policies (if available): disallow costly regions/accelerators you don’t plan to use.
Automating the boring parts
Nightly auto‑stop for idle notebooks
Use Cloud Scheduler to run a daily command that stops notebooks after hours.
BASH
# Cloud Scheduler job (runs daily 22:00) to stop a specific managed runtime
gcloud scheduler jobs create http stop-runtime-job --schedule="0 22 * * *" --uri="https://notebooks.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/runtimes/RUNTIME_NAME:stop" --http-method=POST --oidc-service-account-email=SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com
Alternative: call
gcloud notebooks runtimes list
in a small Cloud Run job, filter bylast_active_time
, and stop any runtime idle > 2h.
Common pitfalls and quick fixes
-
Forgotten endpoints → Undeploy
models; delete endpoints you don’t need.
-
Notebook left running all weekend → Enable
Idle shutdown; schedule nightly stop.
-
Duplicate datasets across buckets/regions →
consolidate; set lifecycle to purge
tmp/
.
-
Too many parallel HPT trials → cap
parallel_trial_count
(2–4) and increasemax_trial_count
gradually.
- Orphaned artifacts in Artifact Registry/GCS → prune old images/artifacts after promoting a single “golden” run.
Challenge 1 — Find and stop idle notebooks
List your notebooks and identify any runtime/instance that has likely been idle for >2 hours. Stop it via CLI.
Hints: gcloud notebooks runtimes list
,
gcloud notebooks instances list
, ... stop
Use gcloud notebooks runtimes list --location=REGION
(Managed) or
gcloud notebooks instances list --location=ZONE
(User‑Managed) to find candidates, then stop them with the corresponding
... stop
command.
Challenge 2 — Write a lifecycle policy
Create and apply a lifecycle rule that (a) deletes objects under
tmp/
after 7 days, and (b) retains only 3 versions of any
object.
Hint:
gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET
Use the JSON policy shown above, then run
gsutil lifecycle set lifecycle.json gs://YOUR_BUCKET
and
verify with gsutil lifecycle get ...
.
Challenge 3 — Endpoint sweep
List deployed endpoints in your region, undeploy any model you don’t need, and delete the endpoint if it’s no longer required.
Hints: gcloud ai endpoints list
,
... describe
, ... undeploy-model
,
... delete
gcloud ai endpoints list --region=REGION
→ pick
ENDPOINT_ID
→
gcloud ai endpoints undeploy-model ENDPOINT_ID --deployed-model-id=DEPLOYED_MODEL_ID --region=REGION --quiet
→ if not needed,
gcloud ai endpoints delete ENDPOINT_ID --region=REGION --quiet
.
- Endpoints and running notebooks are the most common cost leaks; undeploy/stop first.
- Prefer Managed Notebooks with Idle shutdown; schedule nightly auto‑stop.
- Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
- Standardize labels, set budgets, and enable billing export for visibility.
- Use
gcloud
/gsutil
to audit and clean quickly; automate with Scheduler + Cloud Run/Functions.