Data Storage and Access

Last updated on 2026-03-05 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • How can I store and manage data effectively in GCP for Vertex AI workflows?
  • What are the advantages of Google Cloud Storage (GCS) compared to local or VM storage for machine learning projects?
  • How can I load data from GCS into a Vertex AI Workbench notebook?

Objectives

  • Explain data storage options in GCP for machine learning projects.
  • Set up a GCS bucket and upload data.
  • Read data directly from a GCS bucket into memory in a Vertex AI notebook.
  • Monitor storage usage and estimate costs.
  • Upload new files from the Vertex AI environment back to the GCS bucket.

ML/AI projects rely on data, making efficient storage and management essential. Google Cloud offers several storage options, but the most common for ML/AI workflows are Virtual Machine (VM) disks and Google Cloud Storage (GCS) buckets.

Consult your institution’s IT before handling sensitive data in GCP

As with AWS, do not upload restricted or sensitive data to GCP services unless explicitly approved by your institution’s IT or cloud security team. For regulated datasets (HIPAA, FERPA, proprietary), work with your institution to ensure encryption, restricted access, and compliance with policies.

Options for storage: VM Disks or GCS


What is a VM disk?

A VM disk is the storage volume attached to a Compute Engine VM or a Vertex AI Workbench notebook. It can store datasets and intermediate results, but it is tied to the lifecycle of the VM.

When to store data directly on a VM disk

  • Useful for small, temporary datasets processed interactively.
  • Data persists if the VM is stopped, but storage costs continue as long as the disk exists.
  • Not ideal for collaboration, scaling, or long-term dataset storage.
Callout

Limitations of VM disk storage

  • Scalability: Limited by disk size quota.
  • Sharing: Harder to share across projects or team members.
  • Cost: More expensive per GB compared to GCS for long-term storage.

What is a GCS bucket?

For most ML/AI workflows in GCP, Google Cloud Storage (GCS) buckets are recommended. A GCS bucket is a container in Google’s object storage service where you can store an essentially unlimited number of files. Data in GCS can be accessed from Vertex AI training jobs, Workbench notebooks, and other GCP services using a GCS URI (e.g., gs://your-bucket-name/your-file.csv). Think of GCS URIs as cloud file paths — you’ll use them throughout the workshop to reference data in training scripts, notebooks, and SDK calls.

Creating a GCS bucket


1. Sign in to Google Cloud Console

  • Go to console.cloud.google.com and log in with your credentials.
  • Select your project from the project dropdown at the top of the page. If you’re using the shared workshop project, the instructor will provide the project name.
  • In the search bar, type Storage.
  • Click Cloud Storage > Buckets.

3. Create a new bucket

  • Click Create bucket and configure the following settings:

  • Bucket name: Enter a globally unique name using the convention lastname-dataname (e.g., doe-titanic).

  • Labels: Add cost-tracking labels (same keys you used for the Workbench Instance in Episode 2, plus a dataset tag):

    • name = firstname-lastname
    • purpose = workshop
    • dataset = titanic

    In shared accounts, labels are mandatory.

  • Location: Choose Regionus-central1 (same region as your compute to avoid egress charges).

  • Storage class: Standard (best for active ML/AI workflows).

  • Access control: Uniform (simpler IAM-based permissions).

  • Protection: Leave default soft delete enabled; skip versioning and retention policies.

Click Create if everything looks good.

4. Upload files to the bucket

  • If you haven’t yet, download the data for this workshop (Right-click → Save as): data.zip
    • Extract the zip folder contents (Right-click → Extract all on Windows; double-click on macOS).
    • The zip contains the Titanic dataset — passenger information (age, class, fare, etc.) with a survival label. This is a classic binary classification task we’ll use for training in later episodes.
  • In the bucket dashboard, click Upload Files.
  • Select your Titanic CSVs (titanic_train.csv and titanic_test.csv) and upload.

Note the GCS URI for your data After uploading, click on a file and find its gs:// URI (e.g., gs://doe-titanic/titanic_test.csv). This URI will be used to access the data in your notebook.

Adjust bucket permissions


Your bucket exists, but your notebooks and training jobs don’t automatically have permission to use it. GCP follows the principle of least privilege — services only get the access you explicitly grant. In this section we’ll find the service account that Vertex AI uses and give it the right roles on your bucket.

Check your project ID

First, confirm which project your notebook is connected to. Run this cell in your Workbench notebook:

PYTHON

from google.cloud import storage
client = storage.Client()
print(client.project)

Copy the output — you’ll paste it into Cloud Shell commands below.

Callout

These commands run in Cloud Shell, not in a notebook

Open Cloud Shell — a browser-based terminal built into the Google Cloud Console (click the >_ icon in the top-right toolbar). Copy the commands below and paste them into that terminal.

Set your project

If Cloud Shell doesn’t already know your project, set it first:

SH

gcloud config set project YOUR_PROJECT_ID

Replace YOUR_PROJECT_ID with the project ID you copied above. For the shared MLM25 workshop the project ID is doit-rci-mlm25-4626.

Find your service account

When you create a GCP project, Google automatically provisions a Compute Engine default service account. This is the identity that Vertex AI Workbench notebooks and training jobs use when they call other GCP services (like Cloud Storage). By default this account may not have access to your bucket, so we need to grant it the right IAM roles explicitly.

First, look up the service account email:

SH

gcloud iam service-accounts list --filter="displayName:Compute Engine default service account" --format="value(email)"

This will return an email like 123456789-compute@developer.gserviceaccount.com. Copy it — you’ll paste it into the commands below.

Grant permissions

Now we give that service account the ability to read from and write to your bucket. Without these roles, your notebooks would get “Access Denied” errors when trying to load training data or save model artifacts.

Replace YOUR_BUCKET_NAME and YOUR_SERVICE_ACCOUNT, then run:

SH

# objectViewer — lets notebooks READ data (e.g., load CSVs for training)
gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \
  --member="serviceAccount:YOUR_SERVICE_ACCOUNT" \
  --role="roles/storage.objectViewer"

# objectCreator — lets training jobs WRITE outputs (e.g., saved models, logs)
gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \
  --member="serviceAccount:YOUR_SERVICE_ACCOUNT" \
  --role="roles/storage.objectCreator"

# objectAdmin — adds OVERWRITE and DELETE (only needed if you want to
# re-run jobs that replace existing files or clean up old artifacts)
gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \
  --member="serviceAccount:YOUR_SERVICE_ACCOUNT" \
  --role="roles/storage.objectAdmin"
Callout

gcloud storage vs. gsutil

Older tutorials often reference gsutil for Cloud Storage operations. Google now recommends gcloud storage as the primary CLI. Both work, but gcloud storage is actively maintained and consistent with the rest of the gcloud CLI.

Data transfer & storage costs


GCS costs are based on three things: storage class (how you store data), data transfer (moving data in or out of GCP), and operations (API requests). Operations are the individual actions your code performs against Cloud Storage — every time a notebook reads a file or a training job writes a model, that’s an API request.

  • Standard storage: ~ $0.02 per GB per month in us-central1.
  • Uploading data (ingress): Free.
  • Downloading data out of GCP (egress): ~ $0.12 per GB.
  • Cross-region access: ~ $0.01$0.02 per GB within North America.
  • GET requests (reading/downloading objects): ~ $0.004 per 10,000 requests.
  • PUT/POST requests (creating/uploading objects): ~ $0.05 per 10,000 requests.
  • Deleting data: Free (but Nearline/Coldline/Archive early-deletion fees apply).

For detailed pricing, see GCS Pricing Information.

Challenge

Challenge 1: Estimating Storage Costs

1. Estimate the total cost of storing 1 GB in GCS Standard storage (us-central1) for one month assuming: - Dataset read from the bucket 100 times within GCP (e.g., each training or tuning run fetches the data via a GET request — this stays inside Google’s network, so no egress charge) - Data is downloaded once out of GCP to your laptop at the end of the project (this does incur an egress charge)

2. Repeat the above calculation for datasets of 10 GB, 100 GB, and 1 TB (1024 GB).

Hints: Storage $0.02/GB/month, Egress $0.12/GB, GET requests negligible at this scale.

  1. 1 GB: Storage $0.02 + Egress $0.12 = $0.14
  2. 10 GB: $0.20 + $1.20 = $1.40
  3. 100 GB: $2.00 + $12.00 = $14.00
  4. 1 TB: $20.48 + $122.88 = $143.36

Accessing data from your notebook


Now that our bucket is set up, let’s use it from the Workbench notebook you created in the previous episode.

If you haven’t already cloned the repository, open JupyterLab from your Workbench Instance and run !git clone https://github.com/qualiaMachine/Intro_GCP_for_ML.git. Then navigate to /Intro_GCP_for_ML/notebooks/03-Data-storage-and-access.ipynb.

Set up GCP environment

If you haven’t already, initialize the storage client (same code from the permissions section earlier). The storage.Client() call creates a connection using the credentials already attached to your Workbench VM.

PYTHON

from google.cloud import storage
client = storage.Client()
print(client.project)

Reading data directly into memory

The code below downloads a CSV from your bucket and loads it into a pandas DataFrame. The blob.download_as_bytes() call pulls the file contents as raw bytes, and io.BytesIO wraps those bytes in a file-like object that pd.read_csv can read — no temporary file on disk needed.

PYTHON

import pandas as pd
import io

bucket_name = "doe-titanic" # ADJUST to your bucket's name

bucket = client.bucket(bucket_name)
blob = bucket.blob("titanic_train.csv")
train_data = pd.read_csv(io.BytesIO(blob.download_as_bytes()))
print(train_data.shape)
train_data.head()

The Titanic dataset contains passenger information (age, class, fare, etc.) and a binary survival label — we’ll train a classifier on this data in Episode 4.

PYTHON

train_data.info()
train_data.describe()
Callout

Alternative: reading directly with pandas

Vertex AI Workbench comes with gcsfs pre-installed, which lets pandas read GCS URIs directly — no BytesIO conversion needed:

PYTHON

train_data = pd.read_csv("gs://doe-titanic/titanic_train.csv")  # ADJUST bucket name

This is convenient for quick exploration. We use the storage.Client approach above because it gives you more control (listing blobs, checking sizes, uploading), which you’ll need in the sections that follow.

Callout

Common errors

  • Forbidden (403) — Your service account lacks permission. Revisit the Adjust bucket permissions section above.
  • NotFound (404) — The bucket name or file path is wrong. Double-check bucket_name and the blob path with client.list_blobs(bucket_name).
  • DefaultCredentialsError — The notebook cannot find credentials. Make sure you are running on a Vertex AI Workbench Instance (not a local machine).

Monitoring storage usage and costs


It’s good practice to periodically check how much storage your bucket is using. The code below sums up all object sizes.

PYTHON

total_size_bytes = 0
bucket = client.bucket(bucket_name)

for blob in client.list_blobs(bucket_name):
    total_size_bytes += blob.size

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{bucket_name}': {total_size_mb:.2f} MB")

PYTHON

storage_price_per_gb = 0.02   # $/GB/month for Standard storage
egress_price_per_gb = 0.12    # $/GB for internet egress (same-region transfers are free)
total_size_gb = total_size_bytes / (1024**3)

monthly_storage = total_size_gb * storage_price_per_gb
egress_cost = total_size_gb * egress_price_per_gb

print(f"Bucket size: {total_size_gb:.4f} GB")
print(f"Estimated monthly storage cost: ${monthly_storage:.4f}")
print(f"Estimated annual storage cost:  ${monthly_storage*12:.4f}")
print(f"One-time full download (egress) cost: ${egress_cost:.4f}")

Writing output files to GCS


PYTHON

# Create a sample file locally on the notebook VM
file_path = "/home/jupyter/Notes.txt"
with open(file_path, "w") as f:
    f.write("This is a test note for GCS.")

PYTHON

bucket = client.bucket(bucket_name)
blob = bucket.blob("docs/Notes.txt")
blob.upload_from_filename(file_path)
print("File uploaded successfully.")

List bucket contents:

PYTHON

for blob in client.list_blobs(bucket_name):
    print(blob.name)
Challenge

Challenge 2: Read and explore the test dataset

Read titanic_test.csv from your GCS bucket and display its shape. How does the test set compare to the training set in size and columns?

PYTHON

blob = client.bucket(bucket_name).blob("titanic_test.csv")
test_data = pd.read_csv(io.BytesIO(blob.download_as_bytes()))
print("Test shape:", test_data.shape)
print("Train shape:", train_data.shape)
print("Same columns?", list(test_data.columns) == list(train_data.columns))
test_data.head()

Both datasets share the same 12 columns (including Survived). The test set is a smaller held-out subset (179 rows vs 712 in training) — roughly an 80/20 split used for final evaluation after the model is trained.

Challenge

Challenge 3: Upload a summary CSV to GCS

Using train_data, compute the survival rate by passenger class (Pclass) and upload the result as results/survival_by_class.csv to your bucket.

PYTHON

summary = train_data.groupby("Pclass")["Survived"].mean().reset_index()
summary.columns = ["Pclass", "SurvivalRate"]
print(summary)

# Save locally then upload
summary.to_csv("/home/jupyter/survival_by_class.csv", index=False)
blob = client.bucket(bucket_name).blob("results/survival_by_class.csv")
blob.upload_from_filename("/home/jupyter/survival_by_class.csv")
print("Summary uploaded to GCS.")

Removing unused data (complete after the workshop)


After you are done using your data, remove unused files/buckets to stop costs.

You can delete files programmatically. Let’s clean up the notes file we uploaded earlier:

PYTHON

blob = client.bucket(bucket_name).blob("docs/Notes.txt")
blob.delete()
print("docs/Notes.txt deleted.")

Verify it’s gone:

PYTHON

for blob in client.list_blobs(bucket_name):
    print(blob.name)

For larger clean-up tasks, use the Cloud Console:

  • Delete files only – In your bucket, select the files you want to remove and click Delete.
  • Delete the bucket entirely – In Cloud Storage > Buckets, select your bucket and click Delete.

For a detailed walkthrough of cleaning up all workshop resources, see Episode 9: Resource Management and Cleanup.

Key Points
  • Use GCS for scalable, cost-effective, and persistent storage in GCP.
  • Persistent disks are suitable only for small, temporary datasets.
  • Load data from GCS into memory with storage.Client or directly via pd.read_csv("gs://...").
  • Periodically check storage usage and estimate costs to manage your GCS budget.
  • Track your storage, transfer, and request costs to manage expenses.
  • Regularly delete unused data or buckets to avoid ongoing costs.