Vector Databases Lecture Notes

Jul 3, 2024

Lecture on Vector Databases

Introduction

  • Problem: Lack of resources explaining the internal workings of vector databases.
  • Purpose: Provide a simplified explanation of how vector databases function, especially in the context of AI applications.

Objectives

  1. Understand how vector databases work internally.
  2. Example using a vector database with Python code.
  3. Recommendations and sources for further reading.

Facebook AI Similarity Search (FAISS)

  • Library developed by Facebook AI for fast similarity search.
  • Resources: Read the article on Pine Cone's website, "Introduction to Facebook AI Similarity Search."
  • Pine Cone: Popular vector database for storing embeddings, supports serverless databases.
  • FAISS Paper: Technical details on the FAISS library, written by Facebook's AI research team.

Example Code Walkthrough

Dataset Preparation

  • Data: CSV file containing 1,000 sentences (random sentences as an example of book summaries).

  • Goal: Enable semantic search for book summaries beyond keyword search.

  • Embeddings: OpenAI Embeddings API is used to convert sentences into embeddings (vectors of 1,536 dimensions).

  • Loading Data: Using pandas library to load and display the dataset.

  • Generating Embeddings: Custom function using OpenAI API to generate embeddings.

    def get_embedding(sentence):
        # Function to generate embedding for a given sentence using OpenAI API
    

Vector Store Indexing

  • Concept: Use embeddings to facilitate semantic search, comparing a query embedding with existing embeddings to find similar items.
  • Cosine Similarity: Measure used to compare vectors (closer vectors have higher similarity).

Using FAISS Algorithms

Flat L2 Index

  • Index Type: index_flat_l2

  • Characteristic: Stores all vectors and searches by comparing query vector to each stored vector (not scalable for large datasets).

    index = faiss.IndexFlatL2(d)
    index.add(embeddings)
    ``
    

IVF Flat Index

  • Improvement: Uses Inverted File (IVF) indexing for clustering vectors into cells (Voronoid diagram).

  • Algorithm: index_ivf_flat

  • Training Required: Index needs to be trained with a subset of embeddings.

  • nprobe: Parameter to extend the search to neighboring cells.

    quantizer = faiss.IndexFlatL2(d)
    index = faiss.IndexIVFFlat(quantizer, d, nlist)
    index.train(embeddings)
    

IVF PQ Index

  • Compression: Uses Product Quantization (PQ) to reduce vector dimensions, storing more efficient representations for faster search.

  • Algorithm: index_ivfpq

  • Training Required: Similar process to IVF Flat, with additional PQ compression.

    index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
    index.train(embeddings)
    

Using Pine Cone for Vector Storage

  • Setup: Requires Pine Cone API key and account setup.

  • Serverless Database: Easy creation and handling of vector store databases without managing infrastructure.

  • Index Creation: Defining index dimensions and metric (e.g., cosine similarity).

    pinecone.init(api_key='YOUR_API_KEY')
    pinecone.create_index(name='example_index', dimension=d, metric='cosine')
    
  • Data Insertion & Query: Supports batch upsert operations for efficient data insertion and similarity search with meta data retrieval.

    index.upsert(vectors)
    index.query(query_vector, top_k=3)
    

Conclusion

  • Vector Databases: Essential for AI applications involving semantic search and large-scale data retrieval.
  • FAISS & Pine Cone: Powerful tools for implementing efficient similarity search and data storage.
  • Trade-offs: Between search speed and accuracy, depending on the indexing method used.

Additional Recommendations

  • FAISS Library: Read its detailed paper for comprehensive understanding.
  • Pine Cone Website: Check for more resources and tutorials.
  • Serverless Options: Utilize for flexibility and ease of use.

Final Note

  • Vector databases and their underlying algorithms create a quick and efficient way to handle semantic search in AI applications. Understanding these processes offers deeper insights and better implementation strategies.