Training Models in Vertex AI: PyTorch Example
Last updated on 2026-03-05 | Edit this page
Overview
Questions
- When should you consider a GPU (or TPU) instance for PyTorch training in Vertex AI, and what are the trade‑offs for small vs. large workloads?
- How do you launch a script‑based training job and write all artifacts (model, metrics, logs) next to each other in GCS without deploying a managed model?
Objectives
- Prepare the Titanic dataset and save train/val arrays to compressed
.npzfiles in GCS. - Submit a CustomTrainingJob that runs a PyTorch script and
explicitly writes outputs to a chosen
gs://…/artifacts/.../folder. - Co‑locate artifacts:
model.pt(or.joblib),metrics.json,eval_history.csv, andtraining.logfor reproducibility. - Choose CPU vs. GPU instances sensibly; understand when distributed training is (not) worth it.
Initial setup
1. Open pre-filled notebook
Navigate to
/Intro_GCP_for_ML/notebooks/05-Training-models-in-VertexAI-GPUs.ipynb
to begin this notebook. Select the PyTorch environment
(kernel). Local PyTorch is only needed for local tests. Your
Vertex AI job uses the container specified by
container_uri (e.g., pytorch-xla.2-4.py310 for
CPU or pytorch-gpu.2-4.py310 for GPU), so it brings its own
framework at run time.
2. CD to instance home directory
To ensure we’re all in the same starting spot, change directory to your Jupyter home directory.
3. Set environment variables
This code initializes the Vertex AI environment by importing the Python SDK, setting the project, region, and defining a GCS bucket for input/output data.
PYTHON
from google.cloud import aiplatform, storage
client = storage.Client()
PROJECT_ID = client.project
REGION = "us-central1"
BUCKET_NAME = "doe-titanic" # ADJUST to your bucket's name
LAST_NAME = 'DOE' # ADJUST to your last name. Since we're in a shared account environment, this will help us track down jobs in the Console
print(f"project = {PROJECT_ID}\nregion = {REGION}\nbucket = {BUCKET_NAME}")
# initializes the Vertex AI environment with the correct project and location. Staging bucket is used for storing the compressed software that's packaged for training/tuning jobs.
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}/.vertex_staging") # store tar balls in staging folder
Prepare data as .npz
Unlike the XGBoost script from Episode 4 (which handles preprocessing
internally from raw CSV), our PyTorch script expects pre-processed NumPy
arrays. We’ll prepare those here and save them as .npz
files.
Why .npz? NumPy’s .npz files are compressed
binary containers that can store multiple arrays (e.g., features and
labels) together in a single file:
-
Compact & fast: smaller than CSV, and one file
can hold multiple arrays (
X_train,y_train). -
Cloud-friendly: each
.npzis a single GCS object — one network call to read instead of streaming many small files, reducing latency and egress costs. -
Vertex AI integration: when you launch a training
job, GCS objects are automatically staged to the job VM’s local scratch
disk, so
np.load(...)reads from local storage at runtime. -
Reproducible: unlike CSV,
.npzpreserves exact dtypes and shapes across environments.
PYTHON
import pandas as pd
import io
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load Titanic CSV (from local or GCS you've already downloaded to the notebook)
bucket = client.bucket(BUCKET_NAME)
blob = bucket.blob("titanic_train.csv")
df = pd.read_csv(io.BytesIO(blob.download_as_bytes()))
# Minimal preprocessing to numeric arrays
sex_enc = LabelEncoder().fit(df["Sex"])
df["Sex"] = sex_enc.transform(df["Sex"])
df["Embarked"] = df["Embarked"].fillna("S")
emb_enc = LabelEncoder().fit(df["Embarked"])
df["Embarked"] = emb_enc.transform(df["Embarked"])
df["Age"] = df["Age"].fillna(df["Age"].median())
df["Fare"] = df["Fare"].fillna(df["Fare"].median())
X = df[["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].values
y = df["Survived"].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42)
np.savez("/home/jupyter/train_data.npz", X_train=X_train, y_train=y_train)
np.savez("/home/jupyter/val_data.npz", X_val=X_val, y_val=y_val)
We can then upload the files to our GCS bucket.
PYTHON
# Upload to GCS
bucket.blob("data/train_data.npz").upload_from_filename("/home/jupyter/train_data.npz")
bucket.blob("data/val_data.npz").upload_from_filename("/home/jupyter/val_data.npz")
print("Uploaded: gs://%s/data/train_data.npz and val_data.npz" % BUCKET_NAME)
Verify the upload by listing your bucket contents (same pattern as Episode 3):
Minimal PyTorch training script (train_nn.py) - local
test
Running a quick test on the Workbench notebook VM is cheap — it’s a lightweight machine that costs only ~$0.19/hr. The real cost comes later when you launch managed training jobs with larger machines or GPUs. Think of your notebook as a low-cost controller: use it to catch bugs and verify logic before spending on cloud compute.
As you gain confidence, you can skip the notebook VM entirely and run
these tests on your own laptop or lab machine — then submit jobs to
Vertex AI via the gcloud CLI or Python SDK from anywhere
(see Episode 8). That eliminates the
VM cost altogether.
- For large datasets, use a small representative sample of the total dataset when testing locally (i.e., just to verify that code is working and model overfits nearly perfectly after training enough epochs)
- For larger models, use smaller model equivalents (e.g., 100M vs 7B params) when testing locally
Find this file in our repo:
Intro_GCP_for_ML/scripts/train_nn.py. It does three things:
1) loads .npz from local or GCS paths (transparently
handles both) 2) trains a small neural network (a 3-layer MLP) with
early stopping 3) writes all outputs side‑by‑side (model + metrics +
eval history + training.log) to the folder specified by the
AIP_MODEL_DIR environment variable (set automatically by
Vertex AI via base_output_dir), falling back to the current
directory for local runs.
What’s inside train_nn.py? (Quick
reference)
You don’t need to understand every line of the PyTorch code for this workshop — the focus is on how to package and run any training script on Vertex AI. That said, here’s a quick orientation:
-
GCS helpers (top of file):
read_npz_any()andsave_*_any()functions detectgs://paths and use the GCS Python client automatically. This is the key pattern that makes the same script work locally and in the cloud. -
AIP_MODEL_DIR: Vertex AI sets this environment variable to tell your script where to write artifacts. The script reads it at the top ofmain(). -
Model: A small feedforward network
(
TitanicNet) — the architecture details aren’t important for this lesson. -
Early stopping: Training halts when validation loss
stops improving (controlled by
--patience). This saves compute time and cost on cloud jobs.
To test this code, we can run the following:
PYTHON
# configure training hyperparameters to use in all model training runs downstream
MAX_EPOCHS = 500
LR = 0.001
PATIENCE = 50
# local training run
import time as t
start = t.time()
# Example: run your custom training script with args
!python /home/jupyter/Intro_GCP_for_ML/scripts/train_nn.py \
--train /home/jupyter/train_data.npz \
--val /home/jupyter/val_data.npz \
--epochs $MAX_EPOCHS \
--learning_rate $LR \
--patience $PATIENCE
print(f"Total local runtime: {t.time() - start:.2f} seconds")
NumPy version mismatch?
Reproducibility test
Without reproducibility, it’s impossible to gain reliable insights into the efficacy of our methods. An essential component of applied ML/AI is ensuring our experiments are reproducible. Let’s first rerun the same code we did above to verify we get the same result.
- Take a look near the top of
Intro_GCP_for_ML/scripts/train_nn.pywhere we are setting multiple numpy and torch seeds to ensure reproducibility.
PYTHON
import time as t
start = t.time()
# Example: run your custom training script with args
!python /home/jupyter/Intro_GCP_for_ML/scripts/train_nn.py \
--train /home/jupyter/train_data.npz \
--val /home/jupyter/val_data.npz \
--epochs $MAX_EPOCHS \
--learning_rate $LR \
--patience $PATIENCE
print(f"Total local runtime: {t.time() - start:.2f} seconds")
Please don’t use cloud resources for code that is not reproducible!
Evaluate the locally trained model on the validation data
Let’s load the model we just trained and run it against the validation set. This confirms the saved weights produce the expected accuracy before we move to cloud training.
PYTHON
import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet
# load validation data
d = np.load("/home/jupyter/val_data.npz")
X_val, y_val = d["X_val"], d["y_val"]
# Convert to PyTorch tensors with the dtypes the model expects:
# - Features → float32: neural-network layers (Linear, BatchNorm) operate on floats.
# - Labels → long (int64): nn.BCEWithLogitsLoss (and most classification losses)
# expect integer class labels, not floats.
X_val_t = torch.tensor(X_val, dtype=torch.float32)
y_val_t = torch.tensor(y_val, dtype=torch.long)
# rebuild model and load weights
m = TitanicNet()
state = torch.load("/home/jupyter/model.pt", map_location="cpu", weights_only=True)
m.load_state_dict(state)
m.eval()
with torch.no_grad():
probs = m(X_val_t).squeeze(1) # [N], sigmoid outputs in (0,1)
preds_t = (probs >= 0.5).long() # [N] int64
correct = (preds_t == y_val_t).sum().item()
acc = correct / y_val_t.shape[0]
print(f"Local model val accuracy: {acc:.4f}")
We should see an accuracy that matches our best epoch in the local training run. Note that in our setup, early stopping is based on validation loss; not accuracy.
Launch the training job
In the previous episode, we trained an XGBoost model using Vertex AI’s CustomTrainingJob interface. Here, we’ll do the same for a PyTorch neural network. The structure is nearly identical — we define a training script, select a prebuilt container (CPU or GPU), and specify where to write all outputs in Google Cloud Storage (GCS). The main difference is that PyTorch requires us to save our own model weights and metrics inside the script rather than relying on Vertex to package a model automatically.
Set training job configuration vars
Check supported container versions
Container URI format matters. The container must be
registered for python package training (used by
CustomTrainingJob). Use the pytorch-xla
variant with a Python-version suffix — e.g.,
pytorch-xla.2-4.py310:latest. The pytorch-cpu
and pytorch-gpu variants may not be registered for python
package training.
Google periodically retires older versions. If you see an
INVALID_ARGUMENT error about an unsupported image, check
the current list at Prebuilt
containers for training and update the version number.
PYTHON
import datetime as dt
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch/{RUN_ID}"
IMAGE = 'us-docker.pkg.dev/vertex-ai/training/pytorch-xla.2-4.py310:latest'
MACHINE = "n1-standard-4" # CPU fine for small datasets
print(f"RUN_ID = {RUN_ID}\nARTIFACT_DIR = {ARTIFACT_DIR}\nMACHINE = {MACHINE}")
Init the training job with configurations
PYTHON
# init job (this does not consume any resources)
DISPLAY_NAME = f"{LAST_NAME}_pytorch_nn_{RUN_ID}"
print(DISPLAY_NAME)
# init the job. This does not consume resources until we run job.run()
job = aiplatform.CustomTrainingJob(
display_name=DISPLAY_NAME,
script_path="Intro_GCP_for_ML/scripts/train_nn.py",
container_uri=IMAGE)
Run the job, paying for our MACHINE on-demand.
PYTHON
job.run(
args=[
f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
f"--epochs={MAX_EPOCHS}",
f"--learning_rate={LR}",
f"--patience={PATIENCE}",
],
replica_count=1,
machine_type=MACHINE,
base_output_dir=ARTIFACT_DIR, # sets AIP_MODEL_DIR used by your script
sync=True,
)
print("Artifacts folder:", ARTIFACT_DIR)
Monitoring training jobs in the Console
Why do I see both a Training Pipeline and a Custom Job? Under the hood,
CustomTrainingJob.run()creates a TrainingPipeline resource, which in turn launches a CustomJob to do the actual compute work. This is normal — the pipeline is a thin wrapper that manages job lifecycle. You can monitor progress from either view, but Custom Jobs shows the most useful details (logs, machine type, status).
- Go to the Google Cloud Console.
- Navigate to Vertex AI > Training > Custom Jobs.
- Click on your job name to see status, logs, and output model artifacts.
- Cancel jobs from the console if needed (be careful not to stop jobs you don’t own in shared projects).
After the job completes, your training script writes several output
files to the GCS artifact directory. Here’s what you’ll find in
gs://…/artifacts/pytorch/<RUN_ID>/:
-
model.pt— PyTorch weights (state_dict). -
metrics.json— final val loss, hyperparameters, dataset sizes, device, model URI. -
eval_history.csv— per‑epoch validation loss (for plots/regression checks). -
training.log— complete stdout/stderr for reproducibility and debugging.
Evaluate the Vertex-trained model on the validation data
We can check our work to see if this model gives the same result as our “locally” trained model above.
To follow best practices, we will simply load this model into memory from GCS.
PYTHON
import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet
# -----------------
# download model.pt straight into memory and load weights
# -----------------
ARTIFACT_PREFIX = f"artifacts/pytorch/{RUN_ID}/model"
MODEL_PATH = f"{ARTIFACT_PREFIX}/model.pt"
model_blob = bucket.blob(MODEL_PATH)
model_bytes = model_blob.download_as_bytes()
# load from bytes
model_pt = io.BytesIO(model_bytes)
# rebuild model and load weights
state = torch.load(model_pt, map_location="cpu", weights_only=True)
m = TitanicNet()
m.load_state_dict(state)
m.eval();
Evaluate using the same pattern from the CPU evaluation section above — load validation data from GCS, run predictions, and check accuracy. The results should match the CPU job since we set random seeds.
PYTHON
# Read validation data from GCS (reuses val data from local eval above)
VAL_PATH = "data/val_data.npz"
val_blob = bucket.blob(VAL_PATH)
val_bytes = val_blob.download_as_bytes()
d = np.load(io.BytesIO(val_bytes))
X_val, y_val = d["X_val"], d["y_val"]
X_val_t = torch.tensor(X_val, dtype=torch.float32) # features → float for network layers
y_val_t = torch.tensor(y_val, dtype=torch.long) # labels → int64 for loss function
with torch.no_grad():
probs = m(X_val_t).squeeze(1)
preds_t = (probs >= 0.5).long()
correct = (preds_t == y_val_t).sum().item()
acc = correct / y_val_t.shape[0]
print(f"Vertex model val accuracy: {acc:.4f}")
GPU-Accelerated Training on Vertex AI
Our CPU job above worked fine for this small dataset. In practice, you’d switch to a GPU when training takes too long on CPU — typically with larger models (millions of parameters) or larger datasets (hundreds of thousands of rows). For the Titanic dataset, the GPU will likely be slower end-to-end due to provisioning overhead, but we’ll run it here to learn the workflow.
The changes from CPU to GPU are minimal — this is one of the advantages of Vertex AI’s container-based approach:
- The container image switches to the GPU-enabled version
(
pytorch-gpu.2-4.py310:latest), which includes CUDA and cuDNN. - The machine type (
n1-standard-8) defines CPU and memory resources, while we add a GPU accelerator (NVIDIA_TESLA_T4,NVIDIA_L4, etc.). For guidance on selecting a machine type and accelerator, visit the Compute for ML resource. - The training script, arguments, and artifact handling all stay the same.
PYTHON
from google.cloud import aiplatform
RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
# GCS folder where ALL artifacts (model.pt, metrics.json, eval_history.csv, training.log) will be saved.
# Your train_nn.py writes to AIP_MODEL_DIR, and base_output_dir (below) sets that variable for the job.
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch/{RUN_ID}"
# ---- Container image ----
# Use a prebuilt TRAINING image that has PyTorch + CUDA. This enables GPU at runtime.
IMAGE = "us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-4.py310:latest"
# ---- Machine vs Accelerator (important!) ----
# machine_type = the VM's CPU/RAM shape. It is NOT a GPU by itself.
# We often pick n1-standard-8 as a balanced baseline for single-GPU jobs.
MACHINE = "n1-standard-8"
# To actually get a GPU, you *attach* one via accelerator_type + accelerator_count.
# Common choices:
# "NVIDIA_TESLA_T4" (cost-effective, widely available)
# "NVIDIA_L4" (newer, CUDA 12.x, good perf/$)
# "NVIDIA_TESLA_V100" / "NVIDIA_A100_40GB" (high-end, pricey)
ACCELERATOR_TYPE = "NVIDIA_TESLA_T4"
ACCELERATOR_COUNT = 1 # Increase (2,4) only if your code supports multi-GPU (e.g., DDP)
# Alternative (GPU-bundled) machines:
# If you pick an A2 type like "a2-highgpu-1g", it already includes 1 A100 GPU.
# In that case, you can omit accelerator_type/accelerator_count entirely.
# Example:
# MACHINE = "a2-highgpu-1g"
# (and then remove the accelerator_* kwargs in job.run)
print(
"RUN_ID =", RUN_ID,
"\nARTIFACT_DIR =", ARTIFACT_DIR,
"\nIMAGE =", IMAGE,
"\nMACHINE =", MACHINE,
"\nACCELERATOR_TYPE =", ACCELERATOR_TYPE,
"\nACCELERATOR_COUNT =", ACCELERATOR_COUNT,
)
DISPLAY_NAME = f"{LAST_NAME}_pytorch_nn_{RUN_ID}"
job = aiplatform.CustomTrainingJob(
display_name=DISPLAY_NAME,
script_path="Intro_GCP_for_ML/scripts/train_nn.py", # Your PyTorch trainer
container_uri=IMAGE, # Must be a *training* image (not prediction)
)
job.run(
args=[
f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
f"--epochs={MAX_EPOCHS}",
f"--learning_rate={LR}",
f"--patience={PATIENCE}",
],
replica_count=1, # One worker (simple, cheaper)
machine_type=MACHINE, # CPU/RAM shape of the VM (no GPU implied)
accelerator_type=ACCELERATOR_TYPE, # Attaches the selected GPU model
accelerator_count=ACCELERATOR_COUNT, # Number of GPUs to attach
base_output_dir=ARTIFACT_DIR, # Sets AIP_MODEL_DIR used by your script for all artifacts
sync=True, # Waits for job to finish so you can inspect outputs immediately
)
print("Artifacts folder:", ARTIFACT_DIR)
Just as we did for the CPU job, let’s evaluate the GPU-trained model to confirm it produces the same accuracy. We load the model weights directly from GCS into memory.
PYTHON
import sys, torch, numpy as np
sys.path.append("/home/jupyter/Intro_GCP_for_ML/scripts")
from train_nn import TitanicNet
# -----------------
# download model.pt straight into memory and load weights
# -----------------
ARTIFACT_PREFIX = f"artifacts/pytorch/{RUN_ID}/model"
MODEL_PATH = f"{ARTIFACT_PREFIX}/model.pt"
model_blob = bucket.blob(MODEL_PATH)
model_bytes = model_blob.download_as_bytes()
# load from bytes
model_pt = io.BytesIO(model_bytes)
# rebuild model and load weights
state = torch.load(model_pt, map_location="cpu", weights_only=True)
m = TitanicNet()
m.load_state_dict(state)
m.eval();
Evaluate the GPU model using the same pattern — results should match
because we set random seeds in train_nn.py.
PYTHON
with torch.no_grad():
probs = m(X_val_t).squeeze(1)
preds_t = (probs >= 0.5).long()
correct = (preds_t == y_val_t).sum().item()
acc = correct / y_val_t.shape[0]
print(f"GPU model val accuracy: {acc:.4f}")
Cloud workflow review
Now that you’ve run both a CPU and GPU training job, answer the following:
-
Artifact location: Where did Vertex AI write your
model artifacts? How does
base_output_dirinjob.run()relate to theAIP_MODEL_DIRenvironment variable inside the container? - CPU vs. GPU job time: Compare the wall-clock times of your CPU and GPU jobs (visible in the Console under Vertex AI > Training > Custom Jobs). Which was faster? Why might the GPU job be slower for this dataset?
-
Container choice: We used
pytorch-xla.2-4.py310for the CPU job andpytorch-gpu.2-4.py310for the GPU job. What would happen if you used the CPU container but still passedaccelerator_typeandaccelerator_count? -
Cost awareness: You used
n1-standard-4for CPU andn1-standard-8+ T4 for GPU. Using the Compute for ML resource, estimate the relative hourly cost difference between these configurations.
-
base_output_dirtells the Vertex AI SDK to set theAIP_MODEL_DIRenvironment variable inside the training container. Your script readsos.environ.get("AIP_MODEL_DIR", ".")and writes all artifacts there. The result is everything lands undergs://<bucket>/artifacts/pytorch/<RUN_ID>/model/. - For the small Titanic dataset (~700 training rows), the CPU job is typically faster end-to-end. GPU jobs incur extra overhead: provisioning the accelerator, loading CUDA libraries, and transferring data to the GPU. GPU acceleration pays off when training itself is the bottleneck (larger models, larger batches).
- The job would either fail or ignore the GPU. The CPU container doesn’t include CUDA/cuDNN, so even if a GPU is attached to the VM, PyTorch can’t use it. Always match your container image to your hardware configuration.
- Approximate on-demand rates (us-central1):
n1-standard-4is ~$0.19/hr;n1-standard-8+ 1x T4 is ~$0.54/hr (VM) + ~$0.35/hr (T4) = ~$0.89/hr total. The GPU configuration is roughly 4–5x more expensive per hour — worth it only when training speedup exceeds that cost ratio.
GPU and scaling considerations
- On small problems, GPU startup/transfer overhead can erase speedups — benchmark before you scale.
- Stick to a single GPU unless your workload genuinely saturates it. Multi-GPU (data parallelism / DDP) and model parallelism exist for large-scale training but add significant complexity and cost — well beyond this workshop’s scope.
Clean up staging files
As in Episode 4, each job.run() call leaves a tarball
under .vertex_staging/. Delete them to keep your bucket
tidy:
Additional resources
To learn more about PyTorch and Vertex AI integrations, visit the docs: docs.cloud.google.com/vertex-ai/docs/start/pytorch
- Use CustomTrainingJob with a prebuilt PyTorch
container; your script reads
AIP_MODEL_DIR(set automatically bybase_output_dir) to know where to write artifacts. - Keep artifacts together (model, metrics, history, log) in one GCS folder for reproducibility.
-
.npzis a compact, cloud-friendly format — one GCS read per file, preserves exact dtypes. - Start on CPU for small datasets; add a GPU only when training time justifies the extra provisioning overhead and cost.
-
staging_bucketis just for the SDK’s packaging tarball —base_output_diris where your script’s actual artifacts go.