Training Models in Vertex AI: Intro

Last updated on 2025-08-27 | Edit this page

Estimated time: 30 minutes

Overview

Questions

What are the differences between training locally in a Vertex AI notebook and using Vertex AI-managed training jobs?
How do custom training jobs in Vertex AI streamline the training process for various frameworks?
How does Vertex AI handle scaling across CPUs, GPUs, and TPUs?

Objectives

Understand the difference between local training in a Vertex AI Workbench notebook and submitting managed training jobs.
Learn to configure and use Vertex AI custom training jobs for different frameworks (e.g., XGBoost, PyTorch, SKLearn).
Understand scaling options in Vertex AI, including when to use CPUs, GPUs, or TPUs.
Compare performance, cost, and setup between custom scripts and pre-built containers in Vertex AI.
Conduct training with data stored in GCS and monitor training job status using the Google Cloud Console.

Initial setup

1. Open a new .ipynb notebook

Open a fresh Jupyter notebook inside your Vertex AI Workbench instance. You can name it something along the lines of, Training-models.ipynb.

2. CD to instance home directory

So we all can reference helper functions consistently, change directory to your Jupyter home directory.

PYTHON

%cd /home/jupyter/

3. Initialize Vertex AI environment

This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.

PYTHON

from google.cloud import aiplatform
import pandas as pd

# Set your project and region (replace with your values)
PROJECT_ID = "your-gcp-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-gcs-bucket"

# Initialize Vertex AI client
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")

aiplatform.init(): Sets defaults for project, region, and staging bucket.
PROJECT_ID: Identifies your GCP project.
REGION: Determines where training jobs run (choose a region close to your data).
staging_bucket: A GCS bucket for storing datasets, model artifacts, and job outputs.

4. Get code from GitHub repo (skip if already completed)

If you didn’t complete earlier episodes, clone our code repo before moving forward. Check to make sure we’re in our Jupyter home folder first.

PYTHON

%cd /home/jupyter/

PYTHON

# Uncomment below line only if you still need to download the code repo (replace username with your GitHub username)
#!git clone https://github.com/username/GCP_helpers.git

Testing train.py locally in the notebook

Before scaling training jobs onto managed resources, it’s essential to test your training script locally. This prevents wasting GPU/TPU time on bugs or misconfigured code.

Guidelines for testing ML pipelines before scaling

Run tests locally first with small datasets.
Use a subset of your dataset (1–5%) for fast checks.
Start with minimal compute before moving to larger accelerators.
Log key metrics such as loss curves and runtimes.
Verify correctness first before scaling up.

Discussion

What tests should we do before scaling?

Before scaling to multiple or more powerful instances (e.g., GPUs or TPUs), it’s important to run a few sanity checks. In your group, discuss:

Which checks do you think are most critical before scaling up?
What potential issues might we miss if we skip this step?

Show me the solution

Data loads correctly – dataset loads without errors, expected columns exist, missing values handled.
Overfitting check – train on a tiny dataset (e.g., 100 rows). If it doesn’t overfit, something is off.
Loss behavior – verify training loss decreases and doesn’t diverge.
Runtime estimate – get a rough sense of training time on small data.
Memory estimate – check approximate memory use.
Save & reload – ensure model saves, reloads, and infers without errors.

Skipping these can lead to: silent data bugs, runtime blowups at scale, inefficient experiments, or broken model artifacts.

Download data into notebook environment

Sometimes it’s helpful to keep a copy of data in your notebook VM for quick iteration, even though GCS is the preferred storage location.

PYTHON

from google.cloud import storage

client = storage.Client()
bucket = client.bucket(BUCKET_NAME)

blob = bucket.blob("titanic_train.csv")
blob.download_to_filename("titanic_train.csv")

print("Downloaded titanic_train.csv")

Repeat for the test dataset as needed.

Logging runtime & instance info

When comparing runtimes later, it’s useful to know what instance type you ran on. For Workbench:

PYTHON

!cat /sys/class/dmi/id/product_name

This prints the machine type backing your notebook.

Local test run of train.py

PYTHON

import time as t

start = t.time()

# Example: run your custom training script with args
!python GCP_helpers/train_xgboost.py --max_depth 3 --eta 0.1 --subsample 0.8 --colsample_bytree 0.8 --num_round 100 --train titanic_train.csv

print(f"Total local runtime: {t.time() - start:.2f} seconds")

Training on this small dataset should take <1 minute. Log runtime as a baseline.

Training via Vertex AI custom training job

Unlike “local” training, this launches a managed training job that runs on scalable compute. Vertex AI handles provisioning, scaling, logging, and saving outputs to GCS.

Which machine type to start with?

Start with a small CPU machine like n1-standard-4. Only scale up to GPUs/TPUs once you’ve verified your script. See Instances for ML on GCP for guidance.

Creating a custom training job with the SDK

PYTHON

from google.cloud import aiplatform

job = aiplatform.CustomJob(
    display_name="xgboost-train",
    script_path="GCP_helpers/train_xgboost.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.1-5:latest",
    requirements=["pandas", "scikit-learn", "joblib"],
    args=[
        "--max_depth=3",
        "--eta=0.1",
        "--subsample=0.8",
        "--colsample_bytree=0.8",
        "--num_round=100",
        "--train=gs://{}/titanic_train.csv".format(BUCKET_NAME),
    ],
    model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-5:latest",
)

# Run the training job
model = job.run(replica_count=1, machine_type="n1-standard-4")

This launches a managed training job with Vertex AI. Logs and trained models are automatically stored in your GCS bucket.

Monitoring training jobs in the Console

Go to the Google Cloud Console.
Navigate to Vertex AI > Training > Custom Jobs.
Click on your job name to see status, logs, and output model artifacts.
Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).

When training takes too long

Two main options in Vertex AI:

Option 1: Upgrade to more powerful machine types (e.g., add GPUs like T4, V100, A100).
Option 2: Use distributed training with multiple replicas.

Option 1: Upgrade machine type (preferred first step)

Works best for small/medium datasets (<10 GB).
Avoids the coordination overhead of distributed training.
GPUs/TPUs accelerate deep learning tasks significantly.

Option 2: Distributed training with multiple replicas

Supported in Vertex AI for many frameworks.
Split data across replicas, each trains a portion, gradients synchronized.
More beneficial for very large datasets and long-running jobs.

When distributed training makes sense

Dataset >10–50 GB.
Training time >10 hours on single machine.
Deep learning workloads that naturally parallelize across GPUs/TPUs.

Key Points

Environment initialization: Use aiplatform.init() to set defaults for project, region, and bucket.
Local vs managed training: Test locally before scaling into managed jobs.
Custom jobs: Vertex AI lets you run scripts as managed training jobs using pre-built or custom containers.
Scaling: Start small, then scale up to GPUs or distributed jobs as dataset/model size grows.
Monitoring: Track job logs and artifacts in the Vertex AI Console.