Overview of Google Cloud for Machine Learning and AI
- Cloud platforms let you rent hardware on demand instead of buying or waiting for shared resources.
- GCP organizes its ML/AI services under Vertex AI — notebooks, training jobs, tuning, and model hosting.
- The notebook-as-controller pattern keeps your notebook cheap while offloading heavy training to dedicated Vertex AI jobs.
- Everything in this workshop can also be done from the
gcloudCLI (Episode 8).
Notebooks as Controllers
- Use a small Workbench Instance as a controller — delegate heavy training to Vertex AI jobs.
- Workbench VMs inherit service account permissions automatically, simplifying authentication.
- Choose the same region for your Workbench Instance and storage bucket to avoid extra transfer costs.
- Apply labels to all resources for cost tracking, and enable idle auto-stop to avoid surprise charges.
Data Storage and Access
- Use GCS for scalable, cost-effective, and persistent storage in GCP.
- Persistent disks are suitable only for small, temporary datasets.
- Load data from GCS into memory with
storage.Clientor directly viapd.read_csv("gs://..."). - Periodically check storage usage and estimate costs to manage your GCS budget.
- Track your storage, transfer, and request costs to manage expenses.
- Regularly delete unused data or buckets to avoid ongoing costs.
Training Models in Vertex AI: Intro
-
Environment initialization: Use
aiplatform.init()to set defaults for project, region, and bucket.
-
Local vs managed training: Test locally before
scaling into managed jobs.
-
Custom jobs: Vertex AI lets you run scripts as
managed training jobs using pre-built or custom containers.
-
Scaling: Start small, then scale up to GPUs or
distributed jobs as dataset/model size grows.
- Monitoring: Track job logs and artifacts in the Vertex AI Console.
Training Models in Vertex AI: PyTorch Example
- Use CustomTrainingJob with a prebuilt PyTorch
container; your script reads
AIP_MODEL_DIR(set automatically bybase_output_dir) to know where to write artifacts. - Keep artifacts together (model, metrics, history, log) in one GCS folder for reproducibility.
-
.npzis a compact, cloud-friendly format — one GCS read per file, preserves exact dtypes. - Start on CPU for small datasets; add a GPU only when training time justifies the extra provisioning overhead and cost.
-
staging_bucketis just for the SDK’s packaging tarball —base_output_diris where your script’s actual artifacts go.
Hyperparameter Tuning in Vertex AI: Neural Network Example
- Vertex AI Hyperparameter Tuning Jobs efficiently explore parameter spaces using adaptive strategies.
- Define parameter ranges in
parameter_spec; the number of settings tried is controlled later bymax_trial_count. - The
hyperparameter_metric_tagreported bycloudml-hypertunemust exactly match the key inmetric_spec. - Limit
parallel_trial_count(2–4) to help adaptive search. - Use GCS for input/output and aggregate
metrics.jsonacross trials for detailed analysis.
Retrieval-Augmented Generation (RAG) with Vertex AI
- RAG grounds LLM answers in your own data — retrieve first, then generate.
- Vertex AI provides managed embedding and generation APIs that require minimal infrastructure.
- Chunk size, retrieval depth (
top_k), and prompt design are the primary tuning levers. - Always cite retrieved chunks for reproducibility and transparency.
- Embeddings are computed once and reused; generation cost scales with query volume.
Bonus: CLI Workflows Without Notebooks
- Every Vertex AI operation available in the Python SDK has an
equivalent
gcloudCLI command. -
gcloud ai custom-jobs createsubmits training jobs from any terminal — no notebook required. - Use
gcloud auth loginandgcloud auth application-default loginto authenticate outside of Workbench VMs. - Cloud Shell provides free, pre-authenticated CLI access directly in the browser.
- Shell scripts checked into version control are more reproducible than notebooks with hidden state.
- CLI workflows give no visual reminder of running resources — always check for active jobs, endpoints, and VMs before walking away.
- Notebooks and CLI workflows are complementary — use each where it fits best.
Resource Management & Monitoring on Vertex AI (GCP)
- Check Billing → Reports regularly — know what you’re spending before it surprises you.
- Endpoints and running notebooks are the most common cost leaks; undeploy and stop first.
- Set a budget alert — it’s the single most protective action you can take.
- Configure idle shutdown on Workbench Instances so forgotten notebooks auto‑stop.
- Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
- Use labels on all resources so you can trace costs in billing reports.