Overview
This lecture introduces Retrieval-Augmented Generation (RAG) for generative AI, focusing on the architecture, pipeline stages, and key components like chunking, embedding, and database selection.
Introduction to RAG
- RAG stands for Retrieval-Augmented Generation, an architecture used to enhance large language models (LLMs) by incorporating external knowledge sources.
- It connects an LLM (e.g., GPT-4, Gemini, Llama) with external databases to improve accuracy and relevance.
- RAG is preferred when models need current, domain-specific, or task-specific information not present in their training data.
- RAG systems reduce hallucination and enhance response quality by fetching relevant external information.
RAG Architecture & Pipeline
- The RAG pipeline consists of three main stages: Injection, Retrieval, and Generation.
- Injection: Data is collected (documents, PDFs, etc.), split into manageable chunks, converted into embeddings (numerical vectors), indexed, and stored in a database.
- Retrieval: User queries are converted into embeddings, and a semantic search retrieves relevant chunks from the knowledge base.
- Generation: Retrieved information and the user prompt are combined and passed to the LLM, which generates a final response.
Injection Process Components
- Document: Source files can be of any format or location (local, web, cloud).
- Chunking: Documents are split into chunks (sentences or paragraphs) to match LLM token limitations and improve retrieval efficiency.
- Chunks should not be too large (overload) or too small (insufficient context).
- Chunk size depends on data structure, retrieval constraints, system resources, and task requirements.
- Overlapping chunks can improve context coverage.
- Embedding: Chunks are transformed into vector embeddings that capture semantic meaning.
- Modern embeddings leverage neural network models (e.g., OpenAI, BERT, Sentence Transformers).
- Sentence-level embeddings are now preferred over word-level embeddings for preserving context.
- Indexing: Embeddings are indexed to allow fast similarity search during retrieval.
- Database: Common types include vector databases (e.g., Pinecone, ChromaDB), graph databases (neo4j), and SQL/NoSQL databases.
- Vector databases are optimized for similarity search.
- Graph databases store relationships, useful for advanced queries.
Key Terms & Definitions
- RAG (Retrieval-Augmented Generation) — Method to enhance LLMs using external document retrieval and integration.
- LLM (Large Language Model) — AI model trained on large text corpora (e.g., GPT, Gemini, Llama).
- Embedding — Numerical vector representation of text that captures semantics.
- Chunking — Dividing large documents into smaller, manageable pieces for efficient processing.
- Vector Database — Storage optimized for similarity search using vector embeddings.
- Graph Database — Database storing entities as nodes and relationships as edges.
Action Items / Next Steps
- Review the recommended articles on chunking strategies and vector database comparisons.
- Explore documentation for Sentence Transformers and embedding models.
- Prepare for the next session on the retrieval and generation phases of RAG.