RAG Architecture and Pipeline

Overview

This lecture introduces Retrieval-Augmented Generation (RAG) for generative AI, focusing on the architecture, pipeline stages, and key components like chunking, embedding, and database selection.

Introduction to RAG

RAG stands for Retrieval-Augmented Generation, an architecture used to enhance large language models (LLMs) by incorporating external knowledge sources.
It connects an LLM (e.g., GPT-4, Gemini, Llama) with external databases to improve accuracy and relevance.
RAG is preferred when models need current, domain-specific, or task-specific information not present in their training data.
RAG systems reduce hallucination and enhance response quality by fetching relevant external information.

RAG Architecture & Pipeline

The RAG pipeline consists of three main stages: Injection, Retrieval, and Generation.
Injection: Data is collected (documents, PDFs, etc.), split into manageable chunks, converted into embeddings (numerical vectors), indexed, and stored in a database.
Retrieval: User queries are converted into embeddings, and a semantic search retrieves relevant chunks from the knowledge base.
Generation: Retrieved information and the user prompt are combined and passed to the LLM, which generates a final response.

Injection Process Components

Document: Source files can be of any format or location (local, web, cloud).
Chunking: Documents are split into chunks (sentences or paragraphs) to match LLM token limitations and improve retrieval efficiency.
- Chunks should not be too large (overload) or too small (insufficient context).
- Chunk size depends on data structure, retrieval constraints, system resources, and task requirements.
- Overlapping chunks can improve context coverage.
Embedding: Chunks are transformed into vector embeddings that capture semantic meaning.
- Modern embeddings leverage neural network models (e.g., OpenAI, BERT, Sentence Transformers).
- Sentence-level embeddings are now preferred over word-level embeddings for preserving context.
Indexing: Embeddings are indexed to allow fast similarity search during retrieval.
Database: Common types include vector databases (e.g., Pinecone, ChromaDB), graph databases (neo4j), and SQL/NoSQL databases.
- Vector databases are optimized for similarity search.
- Graph databases store relationships, useful for advanced queries.

Key Terms & Definitions

RAG (Retrieval-Augmented Generation) — Method to enhance LLMs using external document retrieval and integration.
LLM (Large Language Model) — AI model trained on large text corpora (e.g., GPT, Gemini, Llama).
Embedding — Numerical vector representation of text that captures semantics.
Chunking — Dividing large documents into smaller, manageable pieces for efficient processing.
Vector Database — Storage optimized for similarity search using vector embeddings.
Graph Database — Database storing entities as nodes and relationships as edges.

Action Items / Next Steps

Review the recommended articles on chunking strategies and vector database comparisons.
Explore documentation for Sentence Transformers and embedding models.
Prepare for the next session on the retrieval and generation phases of RAG.