Overview of Google Cloud for Machine Learning and AI


  • Cloud platforms let you rent hardware on demand instead of buying or waiting for shared resources.
  • GCP organizes its ML/AI services under Vertex AI — notebooks, training jobs, tuning, and model hosting.
  • The notebook-as-controller pattern keeps your notebook cheap while offloading heavy training to dedicated Vertex AI jobs.
  • Everything in this workshop can also be done from the gcloud CLI (Episode 8).

Notebooks as Controllers


  • Use a small Workbench Instance as a controller — delegate heavy training to Vertex AI jobs.
  • Workbench VMs inherit service account permissions automatically, simplifying authentication.
  • Choose the same region for your Workbench Instance and storage bucket to avoid extra transfer costs.
  • Apply labels to all resources for cost tracking, and enable idle auto-stop to avoid surprise charges.

Data Storage and Access


  • Use GCS for scalable, cost-effective, and persistent storage in GCP.
  • Persistent disks are suitable only for small, temporary datasets.
  • Load data from GCS into memory with storage.Client or directly via pd.read_csv("gs://...").
  • Periodically check storage usage and estimate costs to manage your GCS budget.
  • Track your storage, transfer, and request costs to manage expenses.
  • Regularly delete unused data or buckets to avoid ongoing costs.

Training Models in Vertex AI: Intro


  • Environment initialization: Use aiplatform.init() to set defaults for project, region, and bucket.
  • Local vs managed training: Test locally before scaling into managed jobs.
  • Custom jobs: Vertex AI lets you run scripts as managed training jobs using pre-built or custom containers.
  • Scaling: Start small, then scale up to GPUs or distributed jobs as dataset/model size grows.
  • Monitoring: Track job logs and artifacts in the Vertex AI Console.

Training Models in Vertex AI: PyTorch Example


  • Use CustomTrainingJob with a prebuilt PyTorch container; your script reads AIP_MODEL_DIR (set automatically by base_output_dir) to know where to write artifacts.
  • Keep artifacts together (model, metrics, history, log) in one GCS folder for reproducibility.
  • .npz is a compact, cloud-friendly format — one GCS read per file, preserves exact dtypes.
  • Start on CPU for small datasets; add a GPU only when training time justifies the extra provisioning overhead and cost.
  • staging_bucket is just for the SDK’s packaging tarball — base_output_dir is where your script’s actual artifacts go.

Hyperparameter Tuning in Vertex AI: Neural Network Example


  • Vertex AI Hyperparameter Tuning Jobs efficiently explore parameter spaces using adaptive strategies.
  • Define parameter ranges in parameter_spec; the number of settings tried is controlled later by max_trial_count.
  • The hyperparameter_metric_tag reported by cloudml-hypertune must exactly match the key in metric_spec.
  • Limit parallel_trial_count (2–4) to help adaptive search.
  • Use GCS for input/output and aggregate metrics.json across trials for detailed analysis.

Retrieval-Augmented Generation (RAG) with Vertex AI


  • RAG grounds LLM answers in your own data — retrieve first, then generate.
  • Vertex AI provides managed embedding and generation APIs that require minimal infrastructure.
  • Chunk size, retrieval depth (top_k), and prompt design are the primary tuning levers.
  • Always cite retrieved chunks for reproducibility and transparency.
  • Embeddings are computed once and reused; generation cost scales with query volume.

Bonus: CLI Workflows Without Notebooks


  • Every Vertex AI operation available in the Python SDK has an equivalent gcloud CLI command.
  • gcloud ai custom-jobs create submits training jobs from any terminal — no notebook required.
  • Use gcloud auth login and gcloud auth application-default login to authenticate outside of Workbench VMs.
  • Cloud Shell provides free, pre-authenticated CLI access directly in the browser.
  • Shell scripts checked into version control are more reproducible than notebooks with hidden state.
  • CLI workflows give no visual reminder of running resources — always check for active jobs, endpoints, and VMs before walking away.
  • Notebooks and CLI workflows are complementary — use each where it fits best.

Resource Management & Monitoring on Vertex AI (GCP)


  • Check Billing → Reports regularly — know what you’re spending before it surprises you.
  • Endpoints and running notebooks are the most common cost leaks; undeploy and stop first.
  • Set a budget alert — it’s the single most protective action you can take.
  • Configure idle shutdown on Workbench Instances so forgotten notebooks auto‑stop.
  • Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
  • Use labels on all resources so you can trace costs in billing reports.