Advanced RAG Techniques

Overview

The lecture covers the "lost context" problem in Retrieval-Augmented Generation (RAG) systems and introduces two advanced techniques—late chunking and contextual retrieval—to improve retrieval accuracy and reduce hallucinations.

The Lost Context Problem in RAG

RAG agents often provide incomplete or inaccurate answers due to loss of context when chunking documents.
Chunks are typically processed in isolation, causing references within chunks to be misunderstood or missed.
This can lead to poor retrieval results and hallucinations in LLM-generated answers.

Chunking in RAG Systems

Chunking divides documents into smaller segments for embedding and retrieval.
There are different chunking strategies: by sentence, paragraph, characters, with or without overlaps.
No universal chunking strategy; best method depends on document type.

Late Chunking Technique

Late chunking reverses the usual process by embedding the whole document first, then chunking using long-context embedding models.
All chunks' vectors are created with full document context, preserving inter-chunk references.
Pooling/aggregation averages vectors within each chunk for storage in the vector database.
Tools like Gina AI's embedding model support up to ~8,000 tokens; others can support up to 32,000 tokens.

Contextual Retrieval with Context Caching

Proposed by Anthropic, this method generates a one-sentence description of each chunk's context using an LLM.
The chunk and its descriptive blurb are embedded together, anchoring content in broader context.
Enables more accurate answers and reduces hallucinations, especially for longer documents.
Ingestion can be slow and costly due to repeated LLM calls and large context windows.
Prompt caching reduces costs by caching the document context and referencing it for each chunk.

Practical Implementation & Evaluation

Custom code and workflows are required for integrating advanced chunking strategies in tools like N8N.
Both late chunking and contextual retrieval show improved retrieval quality over standard RAG.
Rate limits, token costs, and speed are key challenges when working with large documents.

Key Terms & Definitions

RAG (Retrieval-Augmented Generation) — AI approach using external knowledge retrieval for LLMs to improve answer accuracy.
Chunking — Splitting documents into smaller parts for processing and embedding.
Embedding — Converting text chunks into vector representations for similarity search.
Vector Store/Database — Storage for embedding vectors to enable fast retrieval.
Late Chunking — Embedding the whole document first for contextual chunk vectors before chunking.
Contextual Retrieval — Adding LLM-generated context descriptions to each chunk before embedding.
Prompt Caching — Storing large document context in LLM memory to avoid repeated uploads.

Action Items / Next Steps

Experiment with late chunking using long-context embedding models for your RAG pipeline.
Try contextual retrieval by generating context blurbs with an LLM and adding them to chunk embeddings.
Be mindful of rate limits and token costs when processing large documents.
Evaluate retrieval quality using your own benchmarks and adjust chunking strategies as needed.