Overview of Google Cloud for Machine Learning
- GCP and AWS both provide the essential components for running ML
workloads at scale.
- GCP emphasizes simplicity, open frameworks, and TPU access; AWS
offers broader hardware and automation options.
- TPUs are efficient for TensorFlow and JAX, but GPU-based workflows
(common on AWS) remain more flexible across frameworks.
- Both platforms now provide strong cost tracking and sustainability
tools, with only minor differences in interface and ecosystem
integration.
- Using a notebook as a controller provides reproducibility and helps manage compute and storage resources consistently across clouds.
Data Storage: Setting up GCS
- Use GCS for scalable, cost-effective, and persistent storage in
GCP.
- Persistent disks are suitable only for small, temporary
datasets.
- Track your storage, transfer, and request costs to manage
expenses.
- Regularly delete unused data or buckets to avoid ongoing costs.
Notebooks as Controllers
- Use a small Workbench Instance notebook as a controller to manage
larger, resource-intensive tasks.
- Always navigate to the “Instances” tab in Workbench, since older
notebook types are deprecated.
- Choose the same region for your Workbench Instance and storage
bucket to avoid extra transfer costs.
- Submit training and tuning jobs to scalable instances using the
Vertex AI SDK.
- Labels help track costs effectively, especially in shared or
multi-project environments.
- Workbench Instances come with JupyterLab 3 and GPU frameworks
preinstalled, making them an easy entry point for ML workflows.
- Enable idle auto-stop to avoid unexpected charges when notebooks are left running.
Accessing and Managing Data in GCS with Vertex AI Notebooks
- Load data from GCS into memory to avoid managing local copies when
possible.
- Periodically check storage usage and costs to manage your GCS
budget.
- Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.
Using a GitHub Personal Access Token (PAT) to Push/Pull from a Vertex AI Notebook
- Use a GitHub PAT for HTTPS-based authentication in Vertex AI
Workbench notebooks.
- Securely enter sensitive information in notebooks using
getpass.
- Converting
.ipynbfiles to.pyfiles helps with cleaner version control.
- Adding
.ipynbfiles to.gitignorekeeps your repository organized.
Training Models in Vertex AI: Intro
-
Environment initialization: Use
aiplatform.init()to set defaults for project, region, and bucket.
-
Local vs managed training: Test locally before
scaling into managed jobs.
-
Custom jobs: Vertex AI lets you run scripts as
managed training jobs using pre-built or custom containers.
-
Scaling: Start small, then scale up to GPUs or
distributed jobs as dataset/model size grows.
- Monitoring: Track job logs and artifacts in the Vertex AI Console.
Training Models in Vertex AI: PyTorch Example
- Use CustomTrainingJob with a prebuilt PyTorch
container; let your script control outputs via
--model_out. - Keep artifacts together (model, metrics, history, log) in one folder for reproducibility.
-
.npzspeeds up loading and plays nicely with PyTorch. - Start on CPU for small datasets; use GPU only when profiling shows a clear win.
- Skip
base_output_dirunless you specifically want Vertex’s default run directory; staging bucket is just for the SDK packaging tarball.
Hyperparameter Tuning in Vertex AI: Neural Network Example
- Vertex AI Hyperparameter Tuning Jobs efficiently explore parameter
spaces using adaptive strategies.
- Define parameter ranges in
parameter_spec; the number of settings tried is controlled later bymax_trial_count.
- Keep the printed metric name consistent with
metric_spec(here:validation_accuracy).
- Limit
parallel_trial_count(2–4) to help adaptive search.
- Use GCS for input/output and aggregate
metrics.jsonacross trials for detailed analysis.
Resource Management & Monitoring on Vertex AI (GCP)
- Endpoints and running notebooks are the most common cost leaks; undeploy/stop first.
- Prefer Managed Notebooks with Idle shutdown; schedule nightly auto‑stop.
- Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
- Standardize labels, set budgets, and enable billing export for visibility.
- Use
gcloud/gsutilto audit and clean quickly; automate with Scheduler + Cloud Run/Functions.
Retrieval-Augmented Generation (RAG) with Vertex AI
- Vertex AI’s RAG stack = low-op, cost-predictable.
- Hugging Face = high control, high GPU cost.
- Keep data local or in GCS to manage egress and compliance.
- Always cite retrieved chunks for reproducibility and transparency.