Retrieval Augmented Generation

Introduction

Retrieval-Augmented Generation (RAG) is a method for improving the accuracy of LLMs by retrieving relevant information from external sources before generating a response.

As we’ve discussed, even the best of LLMs can provide the user with less-than-accurate information. This is due (partially) to gaps in the training data. Therefore, when you create LLM apps for business use, relying on default model behavior is not enough. RAG is an approach designed to close those gaps in the training data and provide specific, up-to-date information into the LLM system.

Closing gaps in data requires, naturally, more data. In the context of RAG, this typically takes the form of an internal knowledge base, proprietary research, communication logs, or similar. A RAG model will receive the users prompt, retrieve (reference) the additional documentation provided, and then generate an augmented response curated to the context.

Use cases for this kind of tool are plentiful, but could include:

A customer support bot that pulls from company policies
A student-facing assistant that pulls from course syllabi
An internal help desk that references HR manuals

Below are several links to real life examples of RAG use cases at organizations like LinkedIn, Harvard, and the Royal Bank of Canada:

How It Works (the RAG pipeline)

flowchart TD
  subgraph A[Off-line Ingestion]
    A1[Sources\n• PDFs/Docs/Wiki\n• Tickets/Email/Slack\n• DB/CSV/API\n• Web pages]
    A2[Preprocess & Clean\nOCR • HTML->Text • De-dup]
    A3[Chunking\nsemantic/sentence/windowed]
    A4[Embed Chunks\n(text -> vectors)]
    A5[(Vector DB)\nFAISS/Pinecone/Weaviate]
    A6[(Metadata Store)\nIDs • titles • perms • timestamps • tags]
    A1 --> A2 --> A3 --> A4 --> A5
    A3 --> A6
  end

  subgraph B[Online Query Path]
    B1[User Query]
    B2[Query Understanding\nrewrite • expand • detect intent]
    B3[Retriever\nHybrid: BM25 + Vector kNN\n+ filters from metadata/perms]
    B4[Reranker (cross-encoder)\nrelevance scoring]
    B5[Context Builder\nselect top-N • trim to tokens]
    B6[Prompt Template\nsystem + instructions + context + query]
    B7[LLM Generate\nwith grounded context]
    B8[Post-process\ncitations • tool-calls • safety]
    B9[Answer + Citations]
  end

  subgraph C[Ops & Feedback]
    C1[Telemetry/Logs\nlatency • hit rate • token use]
    C2[Eval Harness\ngroundedness • faithfulness • relevance]
    C3[User Feedback\n👍/👎 • corrections]
    C4[Retraining/Index Refresh\nrechunk • reembed • recalc filters]
  end

  A5 -->|vectors| B3
  A6 -->|metadata/perms| B3
  B1 --> B2 --> B3 --> B4 --> B5 --> B6 --> B7 --> B8 --> B9
  B2 --> C1
  B3 --> C1
  B7 --> C1
  B8 --> C2
  B9 --> C3
  C3 --> C4
  C2 --> C4
  C4 --> A3
  C4 --> A4

Key Components of RAG: Retriever and Generator

The retriever is responsible for fetching relevant information from a large corpus of documents or a knowledge base. Typically, the retriever uses sophisticated search algorithms to identify and rank the most relevant documents based on the input query. These search algorithms can be based on traditional information retrieval techniques, such as TF-IDF or BM25 (which find word matches), or more advanced methods like dense retrieval using neural embeddings (which find semantic matches). Dense retrieval involves encoding both the query and documents into a high-dimensional vector space, allowing for more nuanced similarity comparisons. Code examples at the end of this chapter will outline how to perform deep retrieval.

Once the retriever has identified relevant documents, the generator takes over. The generator is typically a large language model, such as GPT (Generative Pre-trained Transformer), which is fine-tuned to integrate the retrieved information into coherent and contextually appropriate responses. The generator uses the retrieved documents as additional context, effectively ‘grounding’ its outputs in specific, relevant, and often factual data. Again, code examples will be provided at the close of the chapter on how to link the retrieval and generator.

Evaluating the Performance of RAG Systems

Evaluating the performance of Retrieval-augmented Generation (RAG) systems is crucial to ensure their effectiveness and reliability in real-world applications. RAG systems combine information retrieval techniques with generative models to produce answers that are both relevant and contextually appropriate. Given the dual nature of these systems, their evaluation involves assessing both the retrieval and generation components.

The retrieval component of a RAG system is typically evaluated using metrics common in information retrieval, such as Precision, Recall, and F1-score. Precision measures the proportion of relevant documents retrieved among all retrieved documents, while Recall measures the proportion of relevant documents retrieved out of all relevant documents available. F1-score provides a balance between Precision and Recall. These metrics help determine how well the system is retrieving useful information to support the generative model.

For the generative component, evaluation often involves metrics used in natural language processing (NLP), such as BLEU, ROUGE, and METEOR scores. These metrics compare the generated text against a set of reference texts to assess the quality and relevance of the output. BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between the generated text and reference texts, while ROUGE (Recall-Oriented Understudy for Gisting Evaluation) focuses on recall-based overlap. METEOR (Metric for Evaluation of Translation with Explicit ORdering) considers synonyms and stemming, providing a more nuanced evaluation.

Beyond these traditional metrics, user satisfaction and task success rates are also vital in evaluating RAG systems, especially in interactive applications. User studies can provide insights into how well the system meets user needs and expectations. Additionally, task-specific metrics can be designed to assess how effectively a RAG system contributes to achieving specific goals, such as answering customer queries or providing technical support. User Satisfaction is probably the most important metric to be considered.

##The Future of RAG

The scalability of RAG systems will be a critical area of focus. As the volume of data grows, the ability to efficiently retrieve and generate information at scale will be the primary difficulty. Techniques such as distributed computing, parallel processing, and the use of specialized hardware like GPUs and TPUs will be essential to handle the computational demands of large-scale RAG applications. Additionally, innovations in model compression and optimization will ensure that RAG systems remain accessible and efficient even on resource-constrained devices.

Ethical considerations and bias mitigation will also play a significant role in the future of RAG systems. As these systems become more prevalent in decision-making processes and information dissemination, ensuring that they operate fairly and without bias is crucial. Future research will likely focus on developing techniques to detect and mitigate biases in both the retrieval and generation phases, ensuring that RAG systems provide equitable and accurate information to all users.

Building and Evaluating a RAG Chatbot

Step 1: Define Scope and Sources

Before building the system, clarify what kinds of questions the chatbot should be able to answer.

Identify target users (customers, employees, students)
Decide what sources will provide the answers:
- Internal documents (PDFs, wikis, policies)
- Customer support tickets or FAQs
- Structured databases or APIs
- Public web data or research papers

Step 2: Collect and Normalize Content

Gather the documents you want your chatbot to draw from.

Convert all sources into plain text (PDF → text, HTML → markdown)
Clean the text: remove boilerplate, fix encodings, strip ads/navigation
Store metadata such as:
- Document title
- File path or URL
- Author and timestamps
- Permissions or tags

Step 3: Chunk the Documents

Large documents must be broken into smaller, retrievable pieces.

Typical chunk size: 300–600 tokens (roughly a few paragraphs)
Use overlap (10–20%) so context is not lost between chunks
Specialized strategies:
- Sentence or semantic chunking for narrative text
- Table-aware chunking for structured data

Step 4: Create Embeddings

Each chunk is transformed into a numerical vector (embedding).

Choose an embedding model (for example: text-embedding-3-large)
Store both the vector and the metadata
Ensure multilingual support if needed

Step 5: Store Chunks in a Vector Database

Embeddings must be indexed for fast retrieval.

Local development: FAISS
Production-ready: Pinecone, Weaviate, Qdrant, or Elasticsearch
Index metadata alongside vectors so you can filter by:
- Document type
- Date ranges
- User or team permissions

Step 6: Build the Retriever

The retriever locates relevant chunks in response to a query.

Hybrid retrieval (BM25 + vector search) often performs best
Retrieve a broad set of candidates (for example: top 20)
Apply filters (for example: only HR policies for HR-related questions)

Step 7: Rerank the Candidates

Improve accuracy by reranking the retrieved chunks.

Use a cross-encoder reranker (for example: BGE-reranker, Cohere Rerank)
Narrow down to the top 5–7 passages that will fit in your token budget

Step 8: Build the Prompt and Context

Assemble the information to send to the language model.

Template structure:
- System message: role and constraints
- Human message: original query
- Context block: top-ranked passages with citations
Example instruction:
Answer using only the provided context. Cite your sources. If the information is missing, respond with “I don’t know.”

Step 9: Generate the Answer

Send the prompt and context to the language model.

Use a reliable instruct-tuned LLM
Recommended parameters:
- Temperature: 0.1–0.3 (for factual tasks)
- Max tokens: 512–800 (for concise answers)

Step 10: Post-Process the Output

Enhance the raw model response before showing it to the user.

Insert citations or clickable source links
Apply filters to redact sensitive information
Optionally route to external tools (calculator, SQL query executor)

Step 11: Evaluate and Monitor

Continuously measure the chatbot’s performance.

Track metrics:
- Retrieval hit rate
- Latency
- Faithfulness (answers supported by context)
Build a small gold-standard test set of Q&A for benchmarking
Gather user feedback (thumbs up/down, corrections)

Step 12: Deploy and Maintain

Deploy the chatbot into production with monitoring.

Wrap the pipeline in an API (FastAPI, Flask)
Create a user-facing interface (web app, chat widget)
Schedule index updates when new documents are added
Maintain access control by applying metadata filters during retrieval

Code Examples

Example of a ChatGPT-powered RAG Chatbot:

# Install required packages if not already installed:
# pip install langchain faiss-cpu openai pypdf tiktoken

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Step 1: Load documents
loader = PyPDFLoader("policies.pdf")  # replace with your own file
docs = loader.load()

# Step 2: Chunk the documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100
)
chunks = splitter.split_documents(docs)

# Step 3: Create embeddings and store in FAISS
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
db = FAISS.from_documents(chunks, embeddings)

# Step 4: Set up the retriever
retriever = db.as_retriever(search_kwargs={"k": 5})

# Step 5: Build a prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. "
               "Answer strictly from the context. "
               "If unsure, say 'I don't know.' Cite sources when possible."),
    ("human", "Question: {question}\n\nContext:\n{context}")
])

# Step 6: Connect to an LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2, max_tokens=500)

# Step 7: Define a helper to ask questions
def rag_chat(question: str):
    # Retrieve relevant documents
    results = retriever.get_relevant_documents(question)
    context = "\n\n".join([d.page_content for d in results])
    
    # Format prompt
    messages = prompt.format_messages(question=question, context=context)
    
    # Generate answer
    return llm(messages).content

# Example usage
print(rag_chat("What is the vacation carryover policy?"))

Example of deep retrieval:

from sentence_transformers import SentenceTransformer, util

# Load a pre-trained model for dense retrieval
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example documents
documents = [
    "The capital of France is Paris.",
    "Artificial Intelligence is transforming industries.",
    "Python is a popular programming language."
]

# Encode documents to vectors
document_embeddings = model.encode(documents, convert_to_tensor=True)

# Query
query = "What is the capital of France?"

# Encode query to vector
query_embedding = model.encode(query, convert_to_tensor=True)

# Compute cosine similarities
cosine_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)

# Find the highest scoring document
most_relevant_index = cosine_scores.argmax()
print(f"Most relevant document: {documents[most_relevant_index]}")

Example of generation:

from transformers import pipeline

# Initialize a text generation pipeline
generator = pipeline('text-generation', model='gpt2')

# Retrieved context (from the previous retrieval step)
context = "The capital of France is Paris."

# Input prompt for the generator
prompt = f"Based on the information that {context}, what is the capital of France?"

# Generate a response
response = generator(prompt, max_length=50, num_return_sequences=1)

print(f"Generated Response: {response[0]['generated_text']}")

Example use case of RAG as a Customer Support Bot:

from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

# Initialize the tokenizer, retriever, and model
model_name = "facebook/rag-token-base"
tokenizer = RagTokenizer.from_pretrained(model_name)
retriever = RagRetriever.from_pretrained(model_name, index_name="exact")
model = RagTokenForGeneration.from_pretrained(model_name)

# Example query from a customer
query = "Can you tell me more about the battery life of the new XYZ smartphone?"

# Tokenize the input query
input_ids = tokenizer(query, return_tensors="pt").input_ids

# Retrieve relevant documents and generate a response
outputs = model.generate(input_ids, num_return_sequences=1)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("Generated Response:", response[0])

Example of RAG as an educational tool:

# Example: Using RAG for educational purposes
# Query about a complex topic
topic_query = "Explain the concept of quantum entanglement in simple terms."

# Tokenize the input query
input_ids = tokenizer(topic_query, return_tensors="pt").input_ids

# Retrieve relevant documents and generate an educational response
outputs = model.generate(input_ids, num_return_sequences=1)
educational_response = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("Educational Response:", educational_response[0])

Example of python code to produce precision score, recall score, and F1 Score:

from sklearn.metrics import precision_score, recall_score, f1_score

def evaluate_retrieval(true_labels, predicted_labels):
    precision = precision_score(true_labels, predicted_labels, average='binary')
    recall = recall_score(true_labels, predicted_labels, average='binary')
    f1 = f1_score(true_labels, predicted_labels, average='binary')
    return precision, recall, f1

# Example usage
true_labels = [1, 0, 1, 1, 0, 1, 0]
retrieved_labels = [1, 0, 0, 1, 0, 1, 1]
precision, recall, f1 = evaluate_retrieval(true_labels, retrieved_labels)
print(f"Precision: {precision}, Recall: {recall}, F1-score: {f1}")

Example of python code to produce bleu score:

from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

def evaluate_generation(reference_texts, generated_text):
    # BLEU score
    bleu_score = sentence_bleu([ref.split() for ref in reference_texts], generated_text.split())
    
    # ROUGE score
    rouge = Rouge()
    rouge_scores = rouge.get_scores(generated_text, reference_texts, avg=True)
    
    return bleu_score, rouge_scores

# Example usage
reference_texts = ["The cat sat on the mat.", "A cat was sitting on the mat."]
generated_text = "The cat is sitting on the mat."
bleu, rouge = evaluate_generation(reference_texts, generated_text)
print(f"BLEU score: {bleu}\nROUGE scores: {rouge}")

Link to Homework