Overview
This lecture explains how to build a multimodal RAG (Retrieval-Augmented Generation) agent that can extract, analyze, and index text, images, and tables from complex PDFs at scale, enabling advanced chat interactions over the data.
Multimodal RAG Agent Workflow
- Use OCR (Optical Character Recognition) to extract both text and media annotations from PDFs, including scanned and machine-readable documents.
- Mistral's OCR API returns markdown, arrays of extracted images/charts, and uses an AI vision model for image analysis and annotation.
- Store OCR output (text, images, annotations) in a backend server, using Supabase for storage.
- Chunk text and annotate images, then embed each chunk into vector format using an embedding model.
- Store vectors in a vector database for efficient retrieval during queries.
Data Ingestion & Processing
- Retrieve PDF via HTTP request and store its binary data.
- Set up an account with Mistral for OCR and annotation API access.
- Upload PDFs with API key authentication and obtain a signed URL for secure access.
- Fetch OCR results in JSON format, ensuring image annotations are included in the request schema.
- Use JavaScript code to insert image annotations directly into markdown for improved context.
Storing & Embedding Data
- Split OCR output into pages, then process images individually.
- Upload extracted images to Supabase storage, making images accessible via public URLs.
- Replace inline image references in markdown with Supabase URLs and their corresponding AI-generated annotations.
- Use a text splitter to chunk markdown and embed using an embedding model (e.g., OpenAI's text-embedding-3-small).
- Upload text/image embeddings as vectors to the Supabase vector database.
Querying the Data
- Integrate an AI agent (e.g., OpenAI GPT-4.1) that queries the vector store using embedded user queries.
- Retrieve top relevant vector matches, including image URLs and annotations, to compose responses.
- Configure the LLM with a system prompt: only answer using retrieved data; respond with "I don't know" if insufficient information is found.
- Enable chat interface for user interaction, retrieving data and rendering relevant images inline.
Advanced Features & Optimization
- Expand the workflow with more advanced data ingestion pipelines and hybrid RAG strategies.
- Use aggregation and merging logic to combine image files, annotations, and markdown for robust context.
- Allow public or private access to Supabase storage buckets based on project needs.
Key Terms & Definitions
- RAG (Retrieval-Augmented Generation) — A method where an LLM retrieves context from a database to improve answer relevance.
- OCR (Optical Character Recognition) — Technology to extract text and image data from scanned documents.
- Embedding — Conversion of data (text/images) into vector representations for similarity search in databases.
- Vector Database — A database optimized for storing and querying high-dimensional vectors.
- Supabase — An open-source backend platform providing storage and vector database features.
Action Items / Next Steps
- Set up Mistral and Supabase accounts and obtain necessary API keys.
- Configure data ingestion and annotation workflow as demonstrated.
- Explore additional advanced RAG pipelines and check related blueprints in the community resources.