Retrieval-Augmented Generation (RAG) with Vertex AI

Last updated on 2025-10-30 | Edit this page

Estimated time: 30 minutes

Overview

Questions

How do we go from “a pile of PDFs” to “ask a question and get a cited answer” using Google Cloud tools?
What are the key parts of a RAG system (chunking, embedding, retrieval, generation), and how do they map onto Vertex AI services?
How much does each part of this pipeline cost (VM time, embeddings, LLM calls), and where can we keep it cheap?
Can we use open models / Hugging Face instead of Google models, and what does that change?

Objectives

Unpack the core RAG pipeline: ingest → chunk → embed → retrieve → answer.
Run a minimal, fully programmatic RAG loop on a Vertex AI Workbench VM using Google’s own foundation models (for embeddings + generation).
Understand how to substitute open-source / Hugging Face models if you want to avoid managed API costs.
Answer questions using content from provided papers and return citations instead of vibes.

Overview: What we’re building

Retrieval-Augmented Generation (RAG) is a pattern:

You ask a question.
The system retrieves relevant passages from your PDFs or data.
An LLM answers using those passages only, with citations.

This approach powers sustainability-related projects like WattBot, which extracts AI water and energy metrics from research papers.

Cost mindset:
- VM cost: pay for Workbench instance uptime. Stop when not in use.
- Embedding cost: pay per character embedded — only once per doc.
- Generation cost: pay per token for input + output. Shorter prompts = cheaper.

Hugging Face alternatives:
You can replace Google-managed APIs with open models such as:
- Embeddings: sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-large-en-v1.5
- Generators: google/gemma-2b-it, mistralai/Mistral-7B-Instruct, or tiiuae/falcon-7b-instruct
However, this requires a GPU or large CPU VM (e.g., n1-standard-8 + T4) and manual model management. Rather than use a very expensive machine and GPU in Workbench, you can launch custom jobs that perform the embedding and generation steps. Start with a PyTorch image and add HuggingFace as a requirement.

Step 1: Setup environment

PYTHON

!pip install --quiet --upgrade pypdf

Cost note: Installing packages is free; you’re only billed for VM runtime.

Initialize project

PYTHON

from google.cloud import aiplatform
from vertexai import init as vertexai_init
import os

PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "<YOUR_PROJECT_ID>")
REGION = "us-central1"

aiplatform.init(project=PROJECT_ID, location=REGION)
vertexai_init(project=PROJECT_ID, location=REGION)
print("Initialized:", PROJECT_ID, REGION)

Step 2: Extract and chunk PDFs

PYTHON

import zipfile, pathlib, re, pandas as pd
from pypdf import PdfReader

ZIP_PATH = pathlib.Path("/home/jupyter/Intro_GCP_for_ML/data/pdfs_bundle.zip")
DOC_DIR = pathlib.Path("/home/jupyter/docs")
DOC_DIR.mkdir(exist_ok=True)

# unzip
with zipfile.ZipFile(ZIP_PATH, "r") as zf:
    zf.extractall(DOC_DIR)

def chunk_text(text, max_chars=1200, overlap=150):
    for i in range(0, len(text), max_chars - overlap):
        yield text[i:i+max_chars]

rows = []
for pdf in DOC_DIR.glob("*.pdf"):
    txt = ""
    for page in PdfReader(str(pdf)).pages:
        txt += page.extract_text() or ""
    for i, chunk in enumerate(chunk_text(re.sub(r"\s+", " ", txt))):
        rows.append({"doc": pdf.name, "chunk_id": i, "text": chunk})

import pandas as pd
corpus_df = pd.DataFrame(rows)
print(len(corpus_df), "chunks created")

Cost note: Only VM runtime applies. Chunk size affects future embedding cost.

Step 3: Embed text using Vertex AI

Choosing an embedding and generator model

Vertex AI currently offers multiple managed embedding models under the Text Embeddings API family.
For this exercise, we’re using text-embedding-004, which is Google’s latest general-purpose model optimized for semantic similarity, retrieval, and clustering tasks.

Why this model? - Produces 768-dimensional dense vectors suitable for cosine or dot-product similarity.
- Handles long passages (up to ~8,000 tokens) and multilingual content.
- Tuned for retrieval tasks like RAG, document search, and clustering.
- Cost-efficient for classroom-scale workloads (fractions of a cent per document).

If you’d like to explore other options: - Open the Vertex AI Model Garden → Text Embeddings in your GCP console.
- You’ll find specialized alternatives such as: - text-embedding-005 (experimental) – larger model, higher precision on longer documents.
- multimodal-embedding-001 – supports image + text embeddings for richer use cases.
- Third-party embeddings (via Model Garden) – e.g., bge-large-en, cohere-embed-v3, all-MiniLM.

PYTHON

#############################################
# 1. Imports and client setup
#############################################

from google import genai
from google.genai.types import HttpOptions, EmbedContentConfig, GenerateContentConfig
import numpy as np
from sklearn.neighbors import NearestNeighbors

# We'll assume you already have:
#   corpus_df  -> pandas DataFrame with columns: 'text', 'doc', 'chunk_id'
# If not, you'll need to define/load that before running this cell.


#############################################
# 2. Initialize the Gen AI client
#############################################

# vertexai=True = bill/govern in your GCP project instead of the public endpoint
client = genai.Client(
    http_options=HttpOptions(api_version="v1"),
    vertexai=True,
    project="doit-rci-mlm25-4626",
    location="us-central1",
)

# Generation model for answering questions
GENERATION_MODEL_ID = "gemini-2.5-pro"        # or "gemini-2.5-flash" for cheaper/faster

# Embedding model for retrieval
EMBED_MODEL_ID = "gemini-embedding-001"

# Pick an embedding dimensionality and stick to it across corpus + queries.
EMBED_DIM = 1536  # valid typical choices: 768, 1536, 3072

PYTHON

#############################################
# 3. Helper: get embeddings for a list of texts
#############################################

def embed_texts(text_list, batch_size=32, dims=EMBED_DIM):
    """
    Convert a list of text strings into embedding vectors using gemini-embedding-001.
    Returns a NumPy array of shape (len(text_list), dims).
    """
    vectors = []

    # batch to avoid huge single requests
    for start in range(0, len(text_list), batch_size):
        batch = text_list[start:start+batch_size]

        resp = client.models.embed_content(
            model=EMBED_MODEL_ID,
            contents=batch,
            config=EmbedContentConfig(
                task_type="RETRIEVAL_DOCUMENT",   # optimize embeddings for retrieval/use as chunks
                output_dimensionality=dims,       # must match EMBED_DIM everywhere
            ),
        )

        # resp.embeddings is aligned with 'batch'
        for emb in resp.embeddings:
            vectors.append(emb.values)

    return np.array(vectors, dtype="float32")

PYTHON

#############################################
# 4. Embed the corpus and build the NN index
#############################################

# Create embeddings for every text chunk in the corpus
emb_matrix = embed_texts(corpus_df["text"].tolist(), dims=EMBED_DIM)
print("emb_matrix shape:", emb_matrix.shape)   # (num_chunks, EMBED_DIM)

# Fit NearestNeighbors on those embeddings once
nn = NearestNeighbors(
    metric="cosine",   # cosine distance is standard for semantic similarity
    n_neighbors=5,     # default neighborhood size; can override at query time
)
nn.fit(emb_matrix)


#############################################
# 5. Retrieval: given a query string, get top-k relevant chunks
#############################################

def retrieve(query, k=5):
    """
    Embed the user query with the SAME embedding model/dim,
    then find the top-k most similar corpus chunks.
    Returns a DataFrame of the top matches with a 'similarity' column.
    """

    # Embed the query to the same dimension space as emb_matrix
    query_vec = embed_texts([query], dims=EMBED_DIM)[0]   # shape (EMBED_DIM,)

    # Find nearest neighbors using cosine distance
    distances, indices = nn.kneighbors([query_vec], n_neighbors=k, return_distance=True)

    # Grab those rows from the original corpus
    result_df = corpus_df.iloc[indices[0]].copy()

    # Convert cosine distance -> cosine similarity (1 - distance)
    result_df["similarity"] = 1 - distances[0]

    # Sort by similarity descending (highest similarity first)
    result_df = result_df.sort_values("similarity", ascending=False)

    return result_df

PYTHON

#############################################
# 6. ask(): build grounded prompt + call Gemini to answer
#############################################

def ask(query, top_k=5, temperature=0.2):
    """
    Retrieval-Augmented Generation:
    - retrieve context chunks relevant to `query`
    - stuff those chunks into a prompt
    - ask Gemini to answer ONLY using that context
    """

    # Get top_k most relevant text chunks
    hits = retrieve(query, k=top_k)

    # Build a context block with provenance tags like [doc#chunk-id]
    context_lines = [
        f"[{row.doc}#chunk-{row.chunk_id}] {row.text}"
        for _, row in hits.iterrows()
    ]
    context_block = "\n\n".join(context_lines)

    # Instruction prompt for the model
    prompt = (
        "You are a sustainability analyst. "
        "Use only the following context to answer the question.\n\n"
        f"{context_block}\n\n"
        f"Q: {query}\n"
        "A:"
    )

    # Call the generative model
    response = client.models.generate_content(
        model=GENERATION_MODEL_ID,
        contents=prompt,
        config=GenerateContentConfig(
            temperature=temperature,  # lower = more deterministic, factual
        ),
    )

    # Return the model's answer text
    return response.text

Step 5: Generate answers using Gemini

PYTHON


#############################################
# 7. Test the pipeline end-to-end
#############################################

print(
    ask(
        "What is the name of the benchmark suite presented in a recent paper "
        "for measuring inference energy consumption?"
    )
)
# Expected answer: "ML.ENERGY Benchmark"

Step 6: Cost summary

Step	Resource	Example Component	Cost Driver	Typical Range
VM runtime	Vertex AI Workbench	`n1-standard-4`	Uptime (hourly)	~$0.20/hr
Embeddings	text-embedding-004	Managed API	Tokens embedded	~$0.10 / 1M tokens
Retrieval	Local NN	CPU only	None	Free
Generation	gemini-2.5-flash-001	Managed API	Input/output tokens	~$0.25 / 1M tokens
Hugging Face alt	T4 VM	Local model inference	GPU uptime	~$0.35/hr + egress

(Optional) Hugging Face local substitution

To avoid managed API costs, you can instead using Hugging Face models.

```python

Key takeaways

Use Vertex AI managed embeddings and Gemini Flash for lightweight, cost-controlled RAG.
Cache embeddings; reusing them saves most cost.
For open alternatives, use Hugging Face models on GPU VMs (higher cost, more control).
This workflow generalizes to any retrieval task — not just sustainability papers.
GCP’s managed tools lower barrier for experimentation while keeping enterprise security and IAM intact.

Key Points

Vertex AI’s RAG stack = low-op, cost-predictable.
Hugging Face = high control, high GPU cost.
Keep data local or in GCS to manage egress and compliance.
Always cite retrieved chunks for reproducibility and transparency.