Overview of Google Cloud Vertex AI
- Vertex AI simplifies ML workflows by integrating data, training,
tuning, and deployment in one managed platform.
- It reduces the need for manual orchestration compared to traditional
research computing environments.
- Cost monitoring and resource tracking help keep cloud usage affordable for research projects.
Data Storage: Setting up GCS
- Use GCS for scalable, cost-effective, and persistent storage in
GCP.
- Persistent disks are suitable only for small, temporary
datasets.
- Track your storage, transfer, and request costs to manage
expenses.
- Regularly delete unused data or buckets to avoid ongoing costs.
Notebooks as Controllers
- Use a small Vertex AI Workbench notebook instance as a controller to
manage larger, resource-intensive tasks.
- Submit training and tuning jobs to scalable instances using the
Vertex AI SDK.
- Labels help track costs effectively, especially in shared or
multi-project environments.
- Vertex AI Workbench integrates directly with GCS and Vertex AI services, making it a hub for ML workflows.
Accessing and Managing Data in GCS with Vertex AI Notebooks
- Load data from GCS into memory to avoid managing local copies when
possible.
- Periodically check storage usage and costs to manage your GCS
budget.
- Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.
Using a GitHub Personal Access Token (PAT) to Push/Pull from a Vertex AI Notebook
- Use a GitHub PAT for HTTPS-based authentication in Vertex AI
Workbench notebooks.
- Securely enter sensitive information in notebooks using
getpass
.
- Converting
.ipynb
files to.py
files helps with cleaner version control.
- Adding
.ipynb
files to.gitignore
keeps your repository organized.
Training Models in Vertex AI: Intro
-
Environment initialization: Use
aiplatform.init()
to set defaults for project, region, and bucket.
-
Local vs managed training: Test locally before
scaling into managed jobs.
-
Custom jobs: Vertex AI lets you run scripts as
managed training jobs using pre-built or custom containers.
-
Scaling: Start small, then scale up to GPUs or
distributed jobs as dataset/model size grows.
- Monitoring: Track job logs and artifacts in the Vertex AI Console.
Training Models in Vertex AI: PyTorch Example
-
.npz
files streamline PyTorch data handling and reduce I/O overhead.
- GPUs may not speed up small models/datasets due to overhead.
- Vertex AI supports both CPU and GPU training, with scaling via
multiple replicas.
- Data parallelism splits data, model parallelism splits layers —
choose based on model size.
- Test locally first before launching expensive training jobs.
Hyperparameter Tuning in Vertex AI: Neural Network Example
- Vertex AI Hyperparameter Tuning Jobs let you efficiently explore
parameter spaces using adaptive strategies.
- Always test with
max_trial_count=1
first to confirm your setup works.
- Limit
parallel_trial_count
to a small number (2–4) to benefit from adaptive search.
- Use GCS for input/output and monitor jobs through the Vertex AI Console.
Resource Management & Monitoring on Vertex AI (GCP)
- Endpoints and running notebooks are the most common cost leaks; undeploy/stop first.
- Prefer Managed Notebooks with Idle shutdown; schedule nightly auto‑stop.
- Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
- Standardize labels, set budgets, and enable billing export for visibility.
- Use
gcloud
/gsutil
to audit and clean quickly; automate with Scheduler + Cloud Run/Functions.