Overview
The lecture covers the "lost context" problem in Retrieval-Augmented Generation (RAG) systems and introduces two advanced techniques—late chunking and contextual retrieval—to improve retrieval accuracy and reduce hallucinations.
The Lost Context Problem in RAG
- RAG agents often provide incomplete or inaccurate answers due to loss of context when chunking documents.
- Chunks are typically processed in isolation, causing references within chunks to be misunderstood or missed.
- This can lead to poor retrieval results and hallucinations in LLM-generated answers.
Chunking in RAG Systems
- Chunking divides documents into smaller segments for embedding and retrieval.
- There are different chunking strategies: by sentence, paragraph, characters, with or without overlaps.
- No universal chunking strategy; best method depends on document type.
Late Chunking Technique
- Late chunking reverses the usual process by embedding the whole document first, then chunking using long-context embedding models.
- All chunks' vectors are created with full document context, preserving inter-chunk references.
- Pooling/aggregation averages vectors within each chunk for storage in the vector database.
- Tools like Gina AI's embedding model support up to ~8,000 tokens; others can support up to 32,000 tokens.
Contextual Retrieval with Context Caching
- Proposed by Anthropic, this method generates a one-sentence description of each chunk's context using an LLM.
- The chunk and its descriptive blurb are embedded together, anchoring content in broader context.
- Enables more accurate answers and reduces hallucinations, especially for longer documents.
- Ingestion can be slow and costly due to repeated LLM calls and large context windows.
- Prompt caching reduces costs by caching the document context and referencing it for each chunk.
Practical Implementation & Evaluation
- Custom code and workflows are required for integrating advanced chunking strategies in tools like N8N.
- Both late chunking and contextual retrieval show improved retrieval quality over standard RAG.
- Rate limits, token costs, and speed are key challenges when working with large documents.
Key Terms & Definitions
- RAG (Retrieval-Augmented Generation) — AI approach using external knowledge retrieval for LLMs to improve answer accuracy.
- Chunking — Splitting documents into smaller parts for processing and embedding.
- Embedding — Converting text chunks into vector representations for similarity search.
- Vector Store/Database — Storage for embedding vectors to enable fast retrieval.
- Late Chunking — Embedding the whole document first for contextual chunk vectors before chunking.
- Contextual Retrieval — Adding LLM-generated context descriptions to each chunk before embedding.
- Prompt Caching — Storing large document context in LLM memory to avoid repeated uploads.
Action Items / Next Steps
- Experiment with late chunking using long-context embedding models for your RAG pipeline.
- Try contextual retrieval by generating context blurbs with an LLM and adding them to chunk embeddings.
- Be mindful of rate limits and token costs when processing large documents.
- Evaluate retrieval quality using your own benchmarks and adjust chunking strategies as needed.