Overview of Google Cloud for Machine Learning


  • GCP and AWS both provide the essential components for running ML workloads at scale.
  • GCP emphasizes simplicity, open frameworks, and TPU access; AWS offers broader hardware and automation options.
  • TPUs are efficient for TensorFlow and JAX, but GPU-based workflows (common on AWS) remain more flexible across frameworks.
  • Both platforms now provide strong cost tracking and sustainability tools, with only minor differences in interface and ecosystem integration.
  • Using a notebook as a controller provides reproducibility and helps manage compute and storage resources consistently across clouds.

Data Storage: Setting up GCS


  • Use GCS for scalable, cost-effective, and persistent storage in GCP.
  • Persistent disks are suitable only for small, temporary datasets.
  • Track your storage, transfer, and request costs to manage expenses.
  • Regularly delete unused data or buckets to avoid ongoing costs.

Notebooks as Controllers


  • Use a small Workbench Instance notebook as a controller to manage larger, resource-intensive tasks.
  • Always navigate to the “Instances” tab in Workbench, since older notebook types are deprecated.
  • Choose the same region for your Workbench Instance and storage bucket to avoid extra transfer costs.
  • Submit training and tuning jobs to scalable instances using the Vertex AI SDK.
  • Labels help track costs effectively, especially in shared or multi-project environments.
  • Workbench Instances come with JupyterLab 3 and GPU frameworks preinstalled, making them an easy entry point for ML workflows.
  • Enable idle auto-stop to avoid unexpected charges when notebooks are left running.

Accessing and Managing Data in GCS with Vertex AI Notebooks


  • Load data from GCS into memory to avoid managing local copies when possible.
  • Periodically check storage usage and costs to manage your GCS budget.
  • Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.

Using a GitHub Personal Access Token (PAT) to Push/Pull from a Vertex AI Notebook


  • Use a GitHub PAT for HTTPS-based authentication in Vertex AI Workbench notebooks.
  • Securely enter sensitive information in notebooks using getpass.
  • Converting .ipynb files to .py files helps with cleaner version control.
  • Adding .ipynb files to .gitignore keeps your repository organized.

Training Models in Vertex AI: Intro


  • Environment initialization: Use aiplatform.init() to set defaults for project, region, and bucket.
  • Local vs managed training: Test locally before scaling into managed jobs.
  • Custom jobs: Vertex AI lets you run scripts as managed training jobs using pre-built or custom containers.
  • Scaling: Start small, then scale up to GPUs or distributed jobs as dataset/model size grows.
  • Monitoring: Track job logs and artifacts in the Vertex AI Console.

Training Models in Vertex AI: PyTorch Example


  • Use CustomTrainingJob with a prebuilt PyTorch container; let your script control outputs via --model_out.
  • Keep artifacts together (model, metrics, history, log) in one folder for reproducibility.
  • .npz speeds up loading and plays nicely with PyTorch.
  • Start on CPU for small datasets; use GPU only when profiling shows a clear win.
  • Skip base_output_dir unless you specifically want Vertex’s default run directory; staging bucket is just for the SDK packaging tarball.

Hyperparameter Tuning in Vertex AI: Neural Network Example


  • Vertex AI Hyperparameter Tuning Jobs efficiently explore parameter spaces using adaptive strategies.
  • Define parameter ranges in parameter_spec; the number of settings tried is controlled later by max_trial_count.
  • Keep the printed metric name consistent with metric_spec (here: validation_accuracy).
  • Limit parallel_trial_count (2–4) to help adaptive search.
  • Use GCS for input/output and aggregate metrics.json across trials for detailed analysis.

Resource Management & Monitoring on Vertex AI (GCP)


  • Endpoints and running notebooks are the most common cost leaks; undeploy/stop first.
  • Prefer Managed Notebooks with Idle shutdown; schedule nightly auto‑stop.
  • Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
  • Standardize labels, set budgets, and enable billing export for visibility.
  • Use gcloud/gsutil to audit and clean quickly; automate with Scheduler + Cloud Run/Functions.

Retrieval-Augmented Generation (RAG) with Vertex AI


  • Vertex AI’s RAG stack = low-op, cost-predictable.
  • Hugging Face = high control, high GPU cost.
  • Keep data local or in GCS to manage egress and compliance.
  • Always cite retrieved chunks for reproducibility and transparency.