Hyperparameter Tuning in Vertex AI: Neural Network Example

Last updated on 2026-03-05 | Edit this page

Overview

Questions

  • How can we efficiently manage hyperparameter tuning in Vertex AI?
  • How can we parallelize tuning jobs to optimize time without increasing costs?

Objectives

  • Set up and run a hyperparameter tuning job in Vertex AI.
  • Define search spaces using DoubleParameterSpec and IntegerParameterSpec.
  • Log and capture objective metrics for evaluating tuning success.
  • Optimize tuning setup to balance cost and efficiency, including parallelization.

In the previous episode (Episode 5) you submitted a single PyTorch training job to Vertex AI and inspected its artifacts. That gave you one model trained with one set of hyperparameters. In practice, choices like learning rate, early-stopping patience, and regularization thresholds can dramatically affect model quality — and the best combination is rarely obvious up front.

In this episode we’ll use Vertex AI’s Hyperparameter Tuning Jobs to systematically search for better settings. The key is defining a clear search space, ensuring metrics are properly logged, and keeping costs manageable by controlling the number of trials and level of parallelization.

Key steps for hyperparameter tuning

The overall process involves these steps:

  1. Prepare the training script and ensure metrics are logged.
  2. Define the hyperparameter search space.
  3. Configure a hyperparameter tuning job in Vertex AI.
  4. Set data paths and launch the tuning job.
  5. Monitor progress in the Vertex AI Console.
  6. Extract the best model and inspect recorded metrics.

Initial setup


1. Open pre-filled notebook

Navigate to /Intro_GCP_for_ML/notebooks/06-Hyperparameter-tuning.ipynb to begin this notebook. Select the PyTorch environment (kernel). Local PyTorch is only needed for local tests — your Vertex AI job uses the container specified by container_uri (e.g., pytorch-xla.2-4.py310), so it brings its own framework at run time.

2. CD to instance home directory

Change to your Jupyter home folder to keep paths consistent.

PYTHON

%cd /home/jupyter/

Prepare and configure the tuning job


3. Understand how the training script reports metrics

Your training script (train_nn.py) already includes hyperparameter tuning metric reporting — you don’t need to modify it. Here’s how it works:

The script uses the cloudml-hypertune library (pre-installed on Vertex AI training workers) to report metrics so the tuner can compare trials. A try/except block lets the same script run locally without crashing:

PYTHON

# Already in train_nn.py — initialization near the top:
try:
    from hypertune import HyperTune
    _hpt = HyperTune()
    _hpt_enabled = True
except Exception:
    _hpt = None
    _hpt_enabled = False

Inside the training loop, after computing validation metrics each epoch:

PYTHON

# Already in train_nn.py — inside the epoch loop:
if _hpt_enabled:
    _hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag="validation_accuracy",
        metric_value=val_acc,
        global_step=ep,
    )

The critical detail: the hyperparameter_metric_tag string must exactly match the key you use in metric_spec when configuring the tuning job (e.g., "validation_accuracy"). If they don’t match, trials will show as INFEASIBLE.

4. Define hyperparameter search space

This step defines which parameters Vertex AI will vary across trials and their allowed ranges. The number of total settings tested is determined later using max_trial_count.

Vertex AI uses Bayesian optimization by default (internally listed as "ALGORITHM_UNSPECIFIED" in the API). That means if you don’t explicitly specify a search algorithm, Vertex AI automatically applies an adaptive Bayesian strategy to balance exploration (trying new areas of the parameter space) and exploitation (focusing near the best results so far). Each completed trial helps the tuner model how your objective metric (for example, validation_accuracy) changes across parameter values. Subsequent trials then sample new parameter combinations that are statistically more likely to improve performance, which usually yields better results than random or grid search—especially when max_trial_count is limited.

Vertex AI supports four parameter spec types. This episode uses the first two:

Spec type Use case Example
DoubleParameterSpec Continuous floats Learning rate 1e-4 to 1e-2
IntegerParameterSpec Whole numbers Patience 5 to 20
DiscreteParameterSpec Specific numeric values Batch size [32, 64, 128]
CategoricalParameterSpec Named options (strings) Optimizer [“adam”, “sgd”]

Include early-stopping parameters so the tuner can learn good stopping behavior for your dataset:

PYTHON

from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=1e-4, max=1e-2, scale="log"),
    "patience": hpt.IntegerParameterSpec(min=5, max=20, scale="linear"),
    "min_delta": hpt.DoubleParameterSpec(min=1e-6, max=1e-3, scale="log"),
}

5. Initialize Vertex AI, project, and bucket

Initialize the Vertex AI SDK and set your staging and artifact locations in GCS.

PYTHON

from google.cloud import aiplatform, storage
import datetime as dt

client = storage.Client()
PROJECT_ID = client.project
REGION = "us-central1"
LAST_NAME = "DOE"  # change to your name or unique ID
BUCKET_NAME = "doe-titanic"  # replace with your bucket name

aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=f"gs://{BUCKET_NAME}/.vertex_staging",
)

6. Define runtime configuration

Create a unique run ID and set the container, machine type, and base output directory for artifacts. Each variable controls a different aspect of the training environment:

  • RUN_ID — a timestamp that uniquely identifies this tuning session, used to organize artifacts in GCS.
  • ARTIFACT_DIR — the GCS folder where all trial outputs (models, metrics, logs) will be written.
  • IMAGE — the prebuilt Docker container that includes PyTorch and its dependencies.
  • MACHINE — the VM shape (CPU/RAM) for each trial. Start small for testing.
  • ACCELERATOR_TYPE / ACCELERATOR_COUNT — set to unspecified/0 for CPU-only runs. As we saw in Episode 5, GPU overhead isn’t worth it for a dataset this small, and HP tuning launches multiple trials, so unnecessary GPUs multiply cost quickly. Change these to attach a GPU when your model or data genuinely benefits from one.

PYTHON

RUN_ID = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
ARTIFACT_DIR = f"gs://{BUCKET_NAME}/artifacts/pytorch_hpt/{RUN_ID}"

IMAGE = "us-docker.pkg.dev/vertex-ai/training/pytorch-xla.2-4.py310:latest"  # XLA container includes cloudml-hypertune
MACHINE = "n1-standard-4"
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0

7. Configure hyperparameter tuning job

When you use Vertex AI Hyperparameter Tuning Jobs, each trial needs a complete, runnable training configuration: the script, its arguments, the container image, and the compute environment.
Rather than defining these pieces inline each time, we create a CustomJob to hold that configuration.

The CustomJob acts as the blueprint for running a single training task — specifying exactly what to run and on what resources. The tuner then reuses that job definition across all trials, automatically substituting in new hyperparameter values for each run.

This approach has a few practical advantages:

  • You only define the environment once — machine type, accelerators, and output directories are all reused across trials.
  • The tuner can safely inject trial-specific parameters (those declared in parameter_spec) while leaving other arguments unchanged.
  • It provides a clean separation between what a single job does (CustomJob) and how many times to repeat it with new settings (HyperparameterTuningJob).
  • It avoids the extra abstraction layers of higher-level wrappers like CustomTrainingJob, which automatically package code and environments. Using CustomJob.from_local_script keeps the workflow predictable and explicit.

In short:
CustomJob defines how to run one training run.
HyperparameterTuningJob defines how to repeat it with different parameter sets and track results.

The number of total runs is set by max_trial_count, and the number of simultaneous runs is controlled by parallel_trial_count. Each trial’s output and metrics are logged under the GCS base_output_dir.

For a first pass, we’ll run 3 trials fully in parallel. With only 3 trials the adaptive optimizer has almost nothing to learn from, so running them simultaneously costs no search quality. This still validates that the full pipeline works end-to-end (metrics are reported, artifacts land in GCS, the tuner picks a best trial) while giving you a quick look at how results vary across different parameter combinations.

PYTHON

# metric_spec = {"validation_loss": "minimize"} - also stored by train_nn.py
metric_spec = {"validation_accuracy": "maximize"}

custom_job = aiplatform.CustomJob.from_local_script(
    display_name=f"{LAST_NAME}_pytorch_hpt-trial_{RUN_ID}",
    script_path="Intro_GCP_for_ML/scripts/train_nn.py",
    container_uri=IMAGE,
    requirements=["python-json-logger>=2.0.7"],  # resolves a dependency conflict in the prebuilt container
    args=[
        f"--train=gs://{BUCKET_NAME}/data/train_data.npz",
        f"--val=gs://{BUCKET_NAME}/data/val_data.npz",
        "--learning_rate=0.001",        # HPT will override when sampling
        "--patience=10",                # HPT will override when sampling
        "--min_delta=0.001",            # HPT will override when sampling
    ],
    base_output_dir=ARTIFACT_DIR,
    machine_type=MACHINE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_COUNT,
)

DISPLAY_NAME = f"{LAST_NAME}_pytorch_hpt_{RUN_ID}"

# Start with a small batch of 3 trials, all in parallel.
# With so few trials the adaptive optimizer has nothing to learn from,
# so full parallelism costs no search quality — and finishes faster.
tuning_job = aiplatform.HyperparameterTuningJob(
    display_name=DISPLAY_NAME,
    custom_job=custom_job,                 # must be a CustomJob (not CustomTrainingJob)
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=3,                     # small initial sweep
    parallel_trial_count=3,                # all at once — adaptive search needs more data to help
    # search_algorithm="ALGORITHM_UNSPECIFIED",  # default = adaptive search (Bayesian)
    # search_algorithm="RANDOM_SEARCH",          # optional override
    # search_algorithm="GRID_SEARCH",            # optional override
)

tuning_job.run(sync=True)
print("HPT artifacts base:", ARTIFACT_DIR)

Run and analyze results


8. Monitor tuning job

Open Vertex AI → Training → Hyperparameter tuning jobs in the Cloud Console to track trials, parameters, and metrics. You can also stop jobs from the console if needed.

Note: Replace the project ID in the URL below with your own if you are not using the shared workshop project.

For the MLM25 workshop: Hyperparameter tuning jobs.

Callout

Troubleshooting common HPT issues

  • All trials show INFEASIBLE: The hyperparameter_metric_tag in your training script doesn’t match the key in metric_spec. Double-check spelling and case — "validation_accuracy" is not "val_accuracy".
  • Quota errors on launch: Your project may not have enough VM or GPU quota in the selected region. Check IAM & Admin → Quotas and request an increase or switch to a smaller MACHINE type.
  • Trial succeeds but metrics are empty: Make sure cloudml-hypertune is importable inside the container. The prebuilt PyTorch containers include it. If using a custom container, add cloudml-hypertune to your requirements.
  • Job stuck in PENDING: Another tuning or training job may be consuming your quota. Check Vertex AI → Training for running jobs.

9. Inspect best trial results

After completion, look up the best configuration and objective value from the SDK:

PYTHON

best_trial = tuning_job.trials[0]  # best-first
print("Best hyperparameters:", best_trial.parameters)
print("Best validation_accuracy:", best_trial.final_measurement.metrics)

10. Review recorded metrics in GCS

Your script writes a metrics.json (with keys such as final_val_accuracy, final_val_loss) to each trial’s output directory (under ARTIFACT_DIR). The snippet below aggregates those into a dataframe for side-by-side comparison.

PYTHON

from google.cloud import storage
import json, pandas as pd

def list_metrics_from_gcs(ARTIFACT_DIR: str):
    client = storage.Client()
    bucket_name = ARTIFACT_DIR.replace("gs://", "").split("/")[0]
    prefix = "/".join(ARTIFACT_DIR.replace("gs://", "").split("/")[1:])
    blobs = client.list_blobs(bucket_name, prefix=prefix)

    records = []
    for blob in blobs:
        if blob.name.endswith("metrics.json"):
            # Path: …/{RUN_ID}/{trial_number}/model/metrics.json → [-3] = trial number
            trial_id = blob.name.split("/")[-3]
            data = json.loads(blob.download_as_text())
            data["trial_id"] = trial_id
            records.append(data)
    return pd.DataFrame(records)

df = list_metrics_from_gcs(ARTIFACT_DIR)
cols = ["trial_id","final_val_accuracy","final_val_loss","best_val_loss",
        "best_epoch","patience","min_delta","learning_rate"]
df_sorted = df[cols].sort_values("final_val_accuracy", ascending=False)
print(df_sorted)

11. Visualize trial comparison

A quick chart makes it easier to see which trials performed best and how learning rate relates to accuracy:

PYTHON

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart: accuracy per trial
axes[0].barh(df_sorted["trial_id"].astype(str), df_sorted["final_val_accuracy"])
axes[0].set_xlabel("Validation Accuracy")
axes[0].set_ylabel("Trial")
axes[0].set_title("Accuracy by Trial")

# Scatter: learning rate vs accuracy (color = patience)
sc = axes[1].scatter(
    df_sorted["learning_rate"], df_sorted["final_val_accuracy"],
    c=df_sorted["patience"], cmap="viridis", edgecolors="k", s=80,
)
axes[1].set_xscale("log")
axes[1].set_xlabel("Learning Rate (log scale)")
axes[1].set_ylabel("Validation Accuracy")
axes[1].set_title("LR vs. Accuracy (color = patience)")
plt.colorbar(sc, ax=axes[1], label="patience")

plt.tight_layout()
plt.show()
Challenge

Exercise 1: Widen the learning-rate search space

The current search space uses min=1e-4, max=1e-2 for learning rate. Suppose you suspect that slightly larger learning rates (up to 0.1) might converge faster with early stopping enabled.

  1. Update parameter_spec to widen the learning_rate range to max=0.1.
  2. Thinking question: Why does scale="log" make sense for learning rate but scale="linear" makes sense for patience?
  3. Do not run the job yet — just update the configuration.

PYTHON

parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=1e-4, max=1e-1, scale="log"),
    "patience": hpt.IntegerParameterSpec(min=5, max=20, scale="linear"),
    "min_delta": hpt.DoubleParameterSpec(min=1e-6, max=1e-3, scale="log"),
}

Why log vs. linear? Learning rate values span several orders of magnitude (0.0001 to 0.1), so scale="log" ensures the tuner samples evenly across those orders rather than clustering near the high end. Patience is an integer (5–20) where each step is equally meaningful, so scale="linear" is appropriate.

PYTHON

tuning_job = aiplatform.HyperparameterTuningJob(
    display_name=DISPLAY_NAME,
    custom_job=custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=12,
    parallel_trial_count=3,
)

Cost estimate: 12 trials x 5 min each = 60 minutes of compute. At ~ $0.19/hr for n1-standard-4, that’s roughly $0.19 total. With parallel_trial_count=3, wall-clock time would be approximately 20 minutes (4 batches of 3 trials).

Why not run all 12 in parallel? With 12 trials we have enough data for the adaptive optimizer to learn: after each batch of 3 completes, the tuner updates its model of which regions of the search space are promising and steers the next batch toward them. Running all 12 at once would turn the search into an expensive random sweep — every trial would be launched “blind” before any results come back.

Discussion

What is the effect of parallelism in tuning?

  • How might running 10 trials in parallel differ from running 2 at a time in terms of cost, time, and result quality?
  • When would you want to prioritize speed over adaptive search benefits?
Factor High parallelism (e.g., 10) Low parallelism (e.g., 2)
Wall-clock time Shorter Longer
Total cost ~Same (slightly more overhead) ~Same
Adaptive search quality Worse (tuner explores “blind”) Better (tuner learns between batches)
Best for Cheap/short trials, deadlines Expensive trials, small budgets

Why does parallelism hurt result quality? Vertex AI’s adaptive search learns from completed trials to choose better parameter combinations. With many trials in flight simultaneously, the tuner can’t incorporate results quickly — it explores “blind” for longer, often yielding slightly worse results for a fixed max_trial_count. With modest parallelism (2–4), the tuner can update beliefs and exploit promising regions between batches.

Guidelines: - Keep parallel_trial_count to ≤ 25–33% of max_trial_count when you care about adaptive quality. - Increase parallelism when trials are long and the search space is well-bounded.

Callout

When to prioritize speed vs. adaptive quality

Favor higher parallelism when you have strict deadlines, very cheap/short trials where startup time dominates, a non-adaptive search, or unused quota/credits.

Favor lower parallelism when trials are expensive or noisy, max_trial_count is small (≤ 10–20), early stopping is enabled, or you’re exploring many dimensions at once.

Practical recipe: - First run: max_trial_count=3, parallel_trial_count=3 (pipeline sanity check — too few trials for adaptive search to help, so run them all at once). - Main run: max_trial_count=10–20, parallel_trial_count=2–4 (enough trials for the optimizer to learn between batches). - Scale up parallelism only after the above completes cleanly and you confirm adaptive performance is acceptable.

Clean up staging files


HP tuning launches multiple trials, so staging tarballs accumulate even faster. Delete them when you’re done:

PYTHON

!gsutil -m rm -r gs://{BUCKET_NAME}/.vertex_staging/

What’s next: using your tuned model


After tuning, your best model’s weights sit in GCS under the best trial’s artifact directory. The most common next steps are:

  • Batch prediction (most common): Load the best model from GCS and run inference on a dataset — this is what we did in the evaluation sections of Episodes 4–5 when we loaded models from GCS into memory. For larger-scale batch prediction, Vertex AI offers Batch Prediction Jobs that handle provisioning and scaling automatically.
  • Experiment tracking: Vertex AI Experiments can log metrics, parameters, and artifacts across runs for systematic comparison. Consider integrating this into your workflow as your projects grow.
  • Online deployment: If you need real-time predictions via an API, Vertex AI Endpoints let you deploy your model — but endpoints bill continuously (~ $4.50/day for an n1-standard-4), so only deploy when you genuinely need a live API.
Key Points
  • Vertex AI Hyperparameter Tuning Jobs efficiently explore parameter spaces using adaptive strategies.
  • Define parameter ranges in parameter_spec; the number of settings tried is controlled later by max_trial_count.
  • The hyperparameter_metric_tag reported by cloudml-hypertune must exactly match the key in metric_spec.
  • Limit parallel_trial_count (2–4) to help adaptive search.
  • Use GCS for input/output and aggregate metrics.json across trials for detailed analysis.