🤖

Multimodal RAG Workflow

Jun 27, 2025

Overview

This lecture explains how to build a multimodal RAG (Retrieval-Augmented Generation) agent that can extract, analyze, and index text, images, and tables from complex PDFs at scale, enabling advanced chat interactions over the data.

Multimodal RAG Agent Workflow

  • Use OCR (Optical Character Recognition) to extract both text and media annotations from PDFs, including scanned and machine-readable documents.
  • Mistral's OCR API returns markdown, arrays of extracted images/charts, and uses an AI vision model for image analysis and annotation.
  • Store OCR output (text, images, annotations) in a backend server, using Supabase for storage.
  • Chunk text and annotate images, then embed each chunk into vector format using an embedding model.
  • Store vectors in a vector database for efficient retrieval during queries.

Data Ingestion & Processing

  • Retrieve PDF via HTTP request and store its binary data.
  • Set up an account with Mistral for OCR and annotation API access.
  • Upload PDFs with API key authentication and obtain a signed URL for secure access.
  • Fetch OCR results in JSON format, ensuring image annotations are included in the request schema.
  • Use JavaScript code to insert image annotations directly into markdown for improved context.

Storing & Embedding Data

  • Split OCR output into pages, then process images individually.
  • Upload extracted images to Supabase storage, making images accessible via public URLs.
  • Replace inline image references in markdown with Supabase URLs and their corresponding AI-generated annotations.
  • Use a text splitter to chunk markdown and embed using an embedding model (e.g., OpenAI's text-embedding-3-small).
  • Upload text/image embeddings as vectors to the Supabase vector database.

Querying the Data

  • Integrate an AI agent (e.g., OpenAI GPT-4.1) that queries the vector store using embedded user queries.
  • Retrieve top relevant vector matches, including image URLs and annotations, to compose responses.
  • Configure the LLM with a system prompt: only answer using retrieved data; respond with "I don't know" if insufficient information is found.
  • Enable chat interface for user interaction, retrieving data and rendering relevant images inline.

Advanced Features & Optimization

  • Expand the workflow with more advanced data ingestion pipelines and hybrid RAG strategies.
  • Use aggregation and merging logic to combine image files, annotations, and markdown for robust context.
  • Allow public or private access to Supabase storage buckets based on project needs.

Key Terms & Definitions

  • RAG (Retrieval-Augmented Generation) — A method where an LLM retrieves context from a database to improve answer relevance.
  • OCR (Optical Character Recognition) — Technology to extract text and image data from scanned documents.
  • Embedding — Conversion of data (text/images) into vector representations for similarity search in databases.
  • Vector Database — A database optimized for storing and querying high-dimensional vectors.
  • Supabase — An open-source backend platform providing storage and vector database features.

Action Items / Next Steps

  • Set up Mistral and Supabase accounts and obtain necessary API keys.
  • Configure data ingestion and annotation workflow as demonstrated.
  • Explore additional advanced RAG pipelines and check related blueprints in the community resources.