📚

Building Multimodal RAC Systems

Jul 11, 2024

Building Multimodal RAC Systems Lecture Notes

Introduction

  • Focus: Traditional RAC systems focus primarily on text.
  • Example: Wikipedia entries contain text, images, and tables.
  • Goal: Develop systems that can retrieve text, images, and tables.

Methods Overview

  1. Single Vector Space Embeddings

    • Extract text and images separately.
    • Create unified vector space (e.g., CLIP model).
    • Retrieve documents based on unified embeddings.
  2. Modalities to Primary Modality (Text)

    • Convert all modalities (images, etc.) to text embeddings.
    • Utilize models like GPT-4 or Cloud models to generate text descriptions of images.
    • Store these in a unified vector space and retrieve accordingly.
  3. Separate Vector Stores for Each Modality

    • Individual vector stores for text and images.
    • Convert queries to both text and image embeddings for retrieval.
    • Use a multimodal re-ranker for final relevance checking.

Focus of the Lecture: Single Vector Space Embeddings

  • Approach: Using CLIP model to generate unified vector space.
  • Initial Target: Wikipedia pages; can be adapted for PDFs and Word documents.
  • Process: Extract text and images separately, generate embeddings, and store in a unified vector store.

Introduction to CLIP Model

  • Full Form: Contrastive Language-Image Pre-training
  • Developer: OpenAI (2021)
  • Functionality: Accepts images and text pairs, creates cross-sectional embeddings.
  • Advancement: OpenCLIP, an open-source variant with more extensive training data.

Technical Implementation

  • Data Source: Wikipedia articles.
  • Tools Used: Llama Index, Lang Chain (in subsequent videos).
  • Data Processing:
    • Download text and images separately from Wikipedia articles.
    • Example articles: RoboCop, Labour Party (UK), SpaceX, OpenAI.
    • Use CLIP for image embeddings and OpenAI for text embeddings.
    • Store in a unified multimodal vector store using Quadrant.

Code Example Overview

  1. Initial Set-up

    • Install necessary packages: Llama Index, Quadrant, OpenAI CLIP.
    • Download text and images from Wikipedia
  2. Embeddings and Vector Store Creation

    • Separate embeddings for text and images.
    • Combinations into a multimodal vector store.
    • Example: Wikipedia articles on RoboCop, Labour Party, SpaceX, OpenAI.
  3. Retrieval Process

    • Example Queries: "What is the Labour Party?", "Who created RoboCop?"
    • Retrieve top text chunks and images based on embeddings.

Examples and Results

  • Queries generate both text chunks and images as responses.
    • Labour Party: Retrieved descriptions of the party and images of political figures.
    • RoboCop: Retrieved movie information and related images.
    • OpenAI & SpaceX: Retrieved relevant text chunks and images, although some irrelevant due to embedding limitations.

Conclusion and Next Steps

  • Current Focus: Retrieval part.

  • Future Work: Combine chunks and generate a final response.

  • Upcoming Videos: Advanced solutions and end-to-end system development.

  • Recommendations: Subscribe to follow-up videos for comprehensive coverage.