Accessing and Managing Data in GCS with Vertex AI Notebooks

Last updated on 2025-08-27 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How can I load data from GCS into a Vertex AI Workbench notebook?
  • How do I monitor storage usage and costs for my GCS bucket?
  • What steps are involved in pushing new data back to GCS from a notebook?

Objectives

  • Read data directly from a GCS bucket into memory in a Vertex AI notebook.
  • Check storage usage and estimate costs for data in a GCS bucket.
  • Upload new files from the Vertex AI environment back to the GCS bucket.

Initial setup


Open JupyterLab notebook

Once your Vertex AI Workbench notebook instance shows as Running, open it in JupyterLab. Create a new Python 3 notebook and rename it to: Interacting-with-GCS.ipynb.

Set up GCP environment

Before interacting with GCS, we need to authenticate and initialize the client libraries. This ensures our notebook can talk to GCP securely.

PYTHON

from google.cloud import storage
from google.colab import auth
import pandas as pd

# Step 1: Authenticate your account (only prompts if needed)
auth.authenticate_user()

# Step 2: Initialize a GCS client
client = storage.Client()

# Step 3: List buckets in your current project to confirm access
buckets = list(client.list_buckets())
print("Buckets in project:")
for b in buckets:
    print("-", b.name)

Explanation of the pieces:
- auth.authenticate_user(): Ensures you are logged in to your Google account and the notebook can act on your behalf. In Workbench, this usually auto-resolves.
- storage.Client(): Creates a connection to Google Cloud Storage. All read/write actions will use this client.
- list_buckets(): Confirms which storage buckets your account can see in the current project.

This setup block prepares the notebook environment to efficiently interact with GCS resources.

Reading data from GCS


As with S3, you can either (A) read data directly from GCS into memory, or (B) download a copy into your notebook VM. Since we’re using notebooks as controllers rather than training environments, the recommended approach is reading directly from GCS.

A) Reading data directly into memory

PYTHON

bucket_name = "yourname-titanic-gcs"
blob_name = "titanic_train.csv"

bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
train_data = pd.read_csv(pd.io.common.BytesIO(data_bytes))

print(train_data.shape)
train_data.head()

B) Downloading a local copy

PYTHON

bucket_name = "yourname-titanic-gcs"
blob_name = "titanic_train.csv"
local_path = "/home/jupyter/titanic_train.csv"

bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.download_to_filename(local_path)

!ls -lh /home/jupyter/

Checking storage usage of a bucket


PYTHON

total_size_bytes = 0
bucket = client.bucket(bucket_name)

for blob in client.list_blobs(bucket_name):
    total_size_bytes += blob.size

total_size_mb = total_size_bytes / (1024**2)
print(f"Total size of bucket '{bucket_name}': {total_size_mb:.2f} MB")

Estimating storage costs


PYTHON

storage_price_per_gb = 0.02  # $/GB/month for Standard storage
total_size_gb = total_size_bytes / (1024**3)
monthly_cost = total_size_gb * storage_price_per_gb

print(f"Estimated monthly cost: ${monthly_cost:.4f}")
print(f"Estimated annual cost: ${monthly_cost*12:.4f}")

For updated prices, see GCS Pricing.

Writing output files to GCS


PYTHON

# Create a sample file
with open("Notes.txt", "w") as f:
    f.write("This is a test note for GCS.")

# Upload to bucket/docs/
bucket = client.bucket(bucket_name)
blob = bucket.blob("docs/Notes.txt")
blob.upload_from_filename("Notes.txt")

print("File uploaded successfully.")

List bucket contents:

PYTHON

for blob in client.list_blobs(bucket_name):
    print(blob.name)
Challenge

Challenge: Estimating GCS Costs

Suppose you store 50 GB of data in Standard storage (us-central1) for one month.
- Estimate the monthly storage cost.
- Then estimate the cost if you download (egress) the entire dataset once at the end of the month.

Hints
- Storage: $0.02 per GB-month
- Egress: $0.12 per GB

  • Storage cost: 50 GB × $0.02 = $1.00
  • Egress cost: 50 GB × $0.12 = $6.00
  • Total cost: $7.00 for one month including one full download
Key Points
  • Load data from GCS into memory to avoid managing local copies when possible.
  • Periodically check storage usage and costs to manage your GCS budget.
  • Use Vertex AI Workbench notebooks to upload analysis results back to GCS, keeping workflows organized and reproducible.