RAG Architecture and Pipeline

Aug 11, 2025

Overview

This lecture introduces Retrieval-Augmented Generation (RAG) for generative AI, focusing on the architecture, pipeline stages, and key components like chunking, embedding, and database selection.

Introduction to RAG

  • RAG stands for Retrieval-Augmented Generation, an architecture used to enhance large language models (LLMs) by incorporating external knowledge sources.
  • It connects an LLM (e.g., GPT-4, Gemini, Llama) with external databases to improve accuracy and relevance.
  • RAG is preferred when models need current, domain-specific, or task-specific information not present in their training data.
  • RAG systems reduce hallucination and enhance response quality by fetching relevant external information.

RAG Architecture & Pipeline

  • The RAG pipeline consists of three main stages: Injection, Retrieval, and Generation.
  • Injection: Data is collected (documents, PDFs, etc.), split into manageable chunks, converted into embeddings (numerical vectors), indexed, and stored in a database.
  • Retrieval: User queries are converted into embeddings, and a semantic search retrieves relevant chunks from the knowledge base.
  • Generation: Retrieved information and the user prompt are combined and passed to the LLM, which generates a final response.

Injection Process Components

  • Document: Source files can be of any format or location (local, web, cloud).
  • Chunking: Documents are split into chunks (sentences or paragraphs) to match LLM token limitations and improve retrieval efficiency.
    • Chunks should not be too large (overload) or too small (insufficient context).
    • Chunk size depends on data structure, retrieval constraints, system resources, and task requirements.
    • Overlapping chunks can improve context coverage.
  • Embedding: Chunks are transformed into vector embeddings that capture semantic meaning.
    • Modern embeddings leverage neural network models (e.g., OpenAI, BERT, Sentence Transformers).
    • Sentence-level embeddings are now preferred over word-level embeddings for preserving context.
  • Indexing: Embeddings are indexed to allow fast similarity search during retrieval.
  • Database: Common types include vector databases (e.g., Pinecone, ChromaDB), graph databases (neo4j), and SQL/NoSQL databases.
    • Vector databases are optimized for similarity search.
    • Graph databases store relationships, useful for advanced queries.

Key Terms & Definitions

  • RAG (Retrieval-Augmented Generation) — Method to enhance LLMs using external document retrieval and integration.
  • LLM (Large Language Model) — AI model trained on large text corpora (e.g., GPT, Gemini, Llama).
  • Embedding — Numerical vector representation of text that captures semantics.
  • Chunking — Dividing large documents into smaller, manageable pieces for efficient processing.
  • Vector Database — Storage optimized for similarity search using vector embeddings.
  • Graph Database — Database storing entities as nodes and relationships as edges.

Action Items / Next Steps

  • Review the recommended articles on chunking strategies and vector database comparisons.
  • Explore documentation for Sentence Transformers and embedding models.
  • Prepare for the next session on the retrieval and generation phases of RAG.