Training Models in Vertex AI: Intro
Last updated on 2025-08-27 | Edit this page
Overview
Questions
- What are the differences between training locally in a Vertex AI
notebook and using Vertex AI-managed training jobs?
- How do custom training jobs in Vertex AI streamline the training
process for various frameworks?
- How does Vertex AI handle scaling across CPUs, GPUs, and TPUs?
Objectives
- Understand the difference between local training in a Vertex AI
Workbench notebook and submitting managed training jobs.
- Learn to configure and use Vertex AI custom training jobs for
different frameworks (e.g., XGBoost, PyTorch, SKLearn).
- Understand scaling options in Vertex AI, including when to use CPUs,
GPUs, or TPUs.
- Compare performance, cost, and setup between custom scripts and
pre-built containers in Vertex AI.
- Conduct training with data stored in GCS and monitor training job status using the Google Cloud Console.
Initial setup
1. Open a new .ipynb notebook
Open a fresh Jupyter notebook inside your Vertex AI Workbench
instance. You can name it something along the lines of,
Training-models.ipynb
.
2. CD to instance home directory
So we all can reference helper functions consistently, change directory to your Jupyter home directory.
3. Initialize Vertex AI environment
This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.
PYTHON
from google.cloud import aiplatform
import pandas as pd
# Set your project and region (replace with your values)
PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-gcs-bucket"
# Initialize Vertex AI client
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")
-
aiplatform.init()
: Sets defaults for project, region, and staging bucket. -
PROJECT_ID
: Identifies your GCP project. -
REGION
: Determines where training jobs run (choose a region close to your data). -
staging_bucket
: A GCS bucket for storing datasets, model artifacts, and job outputs.
Testing train.py locally in the notebook
Before scaling training jobs onto managed resources, it’s essential to test your training script locally. This prevents wasting GPU/TPU time on bugs or misconfigured code.
Guidelines for testing ML pipelines before scaling
-
Run tests locally first with small datasets.
-
Use a subset of your dataset (1–5%) for fast
checks.
-
Start with minimal compute before moving to larger
accelerators.
-
Log key metrics such as loss curves and
runtimes.
- Verify correctness first before scaling up.
What tests should we do before scaling?
Before scaling to multiple or more powerful instances (e.g., GPUs or TPUs), it’s important to run a few sanity checks. In your group, discuss:
- Which checks do you think are most critical before scaling up?
- What potential issues might we miss if we skip this step?
-
Data loads correctly – dataset loads without
errors, expected columns exist, missing values handled.
-
Overfitting check – train on a tiny dataset (e.g.,
100 rows). If it doesn’t overfit, something is off.
-
Loss behavior – verify training loss decreases and
doesn’t diverge.
-
Runtime estimate – get a rough sense of training
time on small data.
-
Memory estimate – check approximate memory
use.
- Save & reload – ensure model saves, reloads, and infers without errors.
Skipping these can lead to: silent data bugs, runtime blowups at scale, inefficient experiments, or broken model artifacts.
Download data into notebook environment
Sometimes it’s helpful to keep a copy of data in your notebook VM for quick iteration, even though GCS is the preferred storage location.
PYTHON
from google.cloud import storage
client = storage.Client()
bucket = client.bucket(BUCKET_NAME)
blob = bucket.blob("titanic_train.csv")
blob.download_to_filename("titanic_train.csv")
print("Downloaded titanic_train.csv")
Repeat for the test dataset as needed.
Logging runtime & instance info
When comparing runtimes later, it’s useful to know what instance type you ran on. For Workbench:
This prints the machine type backing your notebook.
Local test run of train.py
PYTHON
import time as t
start = t.time()
# Example: run your custom training script with args
!python GCP_helpers/train_xgboost.py --max_depth 3 --eta 0.1 --subsample 0.8 --colsample_bytree 0.8 --num_round 100 --train titanic_train.csv
print(f"Total local runtime: {t.time() - start:.2f} seconds")
Training on this small dataset should take <1 minute. Log runtime as a baseline.
Training via Vertex AI custom training job
Unlike “local” training, this launches a managed training job that runs on scalable compute. Vertex AI handles provisioning, scaling, logging, and saving outputs to GCS.
Which machine type to start with?
Start with a small CPU machine like n1-standard-4
. Only
scale up to GPUs/TPUs once you’ve verified your script. See Instances for ML on GCP for
guidance.
Creating a custom training job with the SDK
PYTHON
from google.cloud import aiplatform
job = aiplatform.CustomJob(
display_name="xgboost-train",
script_path="GCP_helpers/train_xgboost.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.1-5:latest",
requirements=["pandas", "scikit-learn", "joblib"],
args=[
"--max_depth=3",
"--eta=0.1",
"--subsample=0.8",
"--colsample_bytree=0.8",
"--num_round=100",
"--train=gs://{}/titanic_train.csv".format(BUCKET_NAME),
],
model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-5:latest",
)
# Run the training job
model = job.run(replica_count=1, machine_type="n1-standard-4")
This launches a managed training job with Vertex AI. Logs and trained models are automatically stored in your GCS bucket.
Monitoring training jobs in the Console
- Go to the Google Cloud Console.
- Navigate to Vertex AI > Training > Custom
Jobs.
- Click on your job name to see status, logs, and output model
artifacts.
- Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).
When training takes too long
Two main options in Vertex AI:
-
Option 1: Upgrade to more powerful machine types
(e.g., add GPUs like T4, V100, A100).
- Option 2: Use distributed training with multiple replicas.
Option 1: Upgrade machine type (preferred first step)
- Works best for small/medium datasets (<10 GB).
- Avoids the coordination overhead of distributed training.
- GPUs/TPUs accelerate deep learning tasks significantly.
Option 2: Distributed training with multiple replicas
- Supported in Vertex AI for many frameworks.
- Split data across replicas, each trains a portion, gradients
synchronized.
- More beneficial for very large datasets and long-running jobs.
When distributed training makes sense
- Dataset >10–50 GB.
- Training time >10 hours on single machine.
- Deep learning workloads that naturally parallelize across GPUs/TPUs.
-
Environment initialization: Use
aiplatform.init()
to set defaults for project, region, and bucket. -
Local vs managed training: Test locally before
scaling into managed jobs.
-
Custom jobs: Vertex AI lets you run scripts as
managed training jobs using pre-built or custom containers.
-
Scaling: Start small, then scale up to GPUs or
distributed jobs as dataset/model size grows.
- Monitoring: Track job logs and artifacts in the Vertex AI Console.