Intro to Google Cloud Platform (GCP) for Machine Learning and AI: Key Points

Overview of Google Cloud for Machine Learning and AI

Cloud platforms let you rent hardware on demand instead of buying or waiting for shared resources.
GCP organizes its ML/AI services under Vertex AI — notebooks, training jobs, tuning, and model hosting.
The notebook-as-controller pattern keeps your notebook cheap while offloading heavy training to dedicated Vertex AI jobs.
Everything in this workshop can also be done from the gcloud CLI (Episode 8).

Use a small Workbench Instance as a controller — delegate heavy training to Vertex AI jobs.
Workbench VMs inherit service account permissions automatically, simplifying authentication.
Choose the same region for your Workbench Instance and storage bucket to avoid extra transfer costs.
Apply labels to all resources for cost tracking, and enable idle auto-stop to avoid surprise charges.

Use GCS for scalable, cost-effective, and persistent storage in GCP.
Persistent disks are suitable only for small, temporary datasets.
Load data from GCS into memory with storage.Client or directly via pd.read_csv("gs://...").
Periodically check storage usage and estimate costs to manage your GCS budget.
Track your storage, transfer, and request costs to manage expenses.
Regularly delete unused data or buckets to avoid ongoing costs.

Environment initialization: Use aiplatform.init() to set defaults for project, region, and bucket.
Local vs managed training: Test locally before scaling into managed jobs.
Custom jobs: Vertex AI lets you run scripts as managed training jobs using pre-built or custom containers.
Scaling: Start small, then scale up to GPUs or distributed jobs as dataset/model size grows.
Monitoring: Track job logs and artifacts in the Vertex AI Console.

Use CustomTrainingJob with a prebuilt PyTorch container; your script reads AIP_MODEL_DIR (set automatically by base_output_dir) to know where to write artifacts.
Keep artifacts together (model, metrics, history, log) in one GCS folder for reproducibility.
.npz is a compact, cloud-friendly format — one GCS read per file, preserves exact dtypes.
Start on CPU for small datasets; add a GPU only when training time justifies the extra provisioning overhead and cost.
staging_bucket is just for the SDK’s packaging tarball — base_output_dir is where your script’s actual artifacts go.

Vertex AI Hyperparameter Tuning Jobs efficiently explore parameter spaces using adaptive strategies.
Define parameter ranges in parameter_spec; the number of settings tried is controlled later by max_trial_count.
The hyperparameter_metric_tag reported by cloudml-hypertune must exactly match the key in metric_spec.
Limit parallel_trial_count (2–4) to help adaptive search.
Use GCS for input/output and aggregate metrics.json across trials for detailed analysis.

RAG grounds LLM answers in your own data — retrieve first, then generate.
Vertex AI provides managed embedding and generation APIs that require minimal infrastructure.
Chunk size, retrieval depth (top_k), and prompt design are the primary tuning levers.
Always cite retrieved chunks for reproducibility and transparency.
Embeddings are computed once and reused; generation cost scales with query volume.

Every Vertex AI operation available in the Python SDK has an equivalent gcloud CLI command.
gcloud ai custom-jobs create submits training jobs from any terminal — no notebook required.
Use gcloud auth login and gcloud auth application-default login to authenticate outside of Workbench VMs.
Cloud Shell provides free, pre-authenticated CLI access directly in the browser.
Shell scripts checked into version control are more reproducible than notebooks with hidden state.
CLI workflows give no visual reminder of running resources — always check for active jobs, endpoints, and VMs before walking away.
Notebooks and CLI workflows are complementary — use each where it fits best.

Check Billing → Reports regularly — know what you’re spending before it surprises you.
Endpoints and running notebooks are the most common cost leaks; undeploy and stop first.
Set a budget alert — it’s the single most protective action you can take.
Configure idle shutdown on Workbench Instances so forgotten notebooks auto‑stop.
Keep storage tidy with GCS lifecycle policies and avoid duplicate datasets.
Use labels on all resources so you can trace costs in billing reports.