đź§ 

Advanced RAG Techniques

Jun 18, 2025

Overview

The lecture covers the "lost context" problem in Retrieval-Augmented Generation (RAG) systems and introduces two advanced techniques—late chunking and contextual retrieval—to improve retrieval accuracy and reduce hallucinations.

The Lost Context Problem in RAG

  • RAG agents often provide incomplete or inaccurate answers due to loss of context when chunking documents.
  • Chunks are typically processed in isolation, causing references within chunks to be misunderstood or missed.
  • This can lead to poor retrieval results and hallucinations in LLM-generated answers.

Chunking in RAG Systems

  • Chunking divides documents into smaller segments for embedding and retrieval.
  • There are different chunking strategies: by sentence, paragraph, characters, with or without overlaps.
  • No universal chunking strategy; best method depends on document type.

Late Chunking Technique

  • Late chunking reverses the usual process by embedding the whole document first, then chunking using long-context embedding models.
  • All chunks' vectors are created with full document context, preserving inter-chunk references.
  • Pooling/aggregation averages vectors within each chunk for storage in the vector database.
  • Tools like Gina AI's embedding model support up to ~8,000 tokens; others can support up to 32,000 tokens.

Contextual Retrieval with Context Caching

  • Proposed by Anthropic, this method generates a one-sentence description of each chunk's context using an LLM.
  • The chunk and its descriptive blurb are embedded together, anchoring content in broader context.
  • Enables more accurate answers and reduces hallucinations, especially for longer documents.
  • Ingestion can be slow and costly due to repeated LLM calls and large context windows.
  • Prompt caching reduces costs by caching the document context and referencing it for each chunk.

Practical Implementation & Evaluation

  • Custom code and workflows are required for integrating advanced chunking strategies in tools like N8N.
  • Both late chunking and contextual retrieval show improved retrieval quality over standard RAG.
  • Rate limits, token costs, and speed are key challenges when working with large documents.

Key Terms & Definitions

  • RAG (Retrieval-Augmented Generation) — AI approach using external knowledge retrieval for LLMs to improve answer accuracy.
  • Chunking — Splitting documents into smaller parts for processing and embedding.
  • Embedding — Converting text chunks into vector representations for similarity search.
  • Vector Store/Database — Storage for embedding vectors to enable fast retrieval.
  • Late Chunking — Embedding the whole document first for contextual chunk vectors before chunking.
  • Contextual Retrieval — Adding LLM-generated context descriptions to each chunk before embedding.
  • Prompt Caching — Storing large document context in LLM memory to avoid repeated uploads.

Action Items / Next Steps

  • Experiment with late chunking using long-context embedding models for your RAG pipeline.
  • Try contextual retrieval by generating context blurbs with an LLM and adding them to chunk embeddings.
  • Be mindful of rate limits and token costs when processing large documents.
  • Evaluate retrieval quality using your own benchmarks and adjust chunking strategies as needed.