🗃️

Vector Database Lecture

Jul 14, 2024

Vector Database Lecture Notes

Introduction

  • Presenter: B Ahmed
  • Topic: Vector Database
  • Overview: Discussion on what vector databases are, why they are used, practical demos with ChromaDB, Pinecone, and Weaviate, and integration with large language models (LLMs).

Video Outline

  1. What is a Vector Database?
  2. Why do we need Vector Databases?
  3. How do Vector Databases work?
  4. Use Cases of Vector Databases
  5. Popular Vector Databases
  6. Practical Demos: Using ChromaDB, Pinecone, and Weaviate

Vector Database Overview

  • Definition: A database used for storing high-dimensional vectors like word embeddings or image embeddings.
  • Example Data Types: Documents, images, PDFs.
  • Purpose: Transform high-dimensional data into vectors for efficient storage and retrieval.

Why Use Vector Databases?

  • Unstructured Data: 80-85% of data is unstructured (images, videos, text, audio).
  • Relational Database Limitations: Traditional databases aren't efficient for unstructured data storage and querying.
  • Example: Storing and querying images in relational databases requires manual schema definitions and labels, which is inefficient.

Introducing Vector Embeddings

  • Unstructured Data Types: Images, text, audio, video.
  • Embedding Model: Neural network-based models to convert data into numerical vector representations.
    • Examples: Word2Vec, Transformer models, OpenAI embeddings, Hugging Face embeddings, LLaMA embeddings, Google PaLM embeddings.
  • Vector Database: Stores vector embeddings and allows similarity search.

Practical Demo Plan

  • Tools: ChromaDB, Pinecone, Weaviate, Python, LangChain.
  • Steps: Create embeddings, store in vector databases, perform tasks using LLMs.

In-Depth Vector Database Explanation

  • Embedding Example: Converting words like King, Queen, Man, Women into vectors based on features like gender, wealth, power, weight, and speaking ability.
  • Vector Calculation: Use of cosine similarity, Euclidean distance, Manhattan distance for similarity searches.
  • Indexing: Use of data structures to make similarity searches faster.

Use Cases of Vector Databases

  1. Long-term memory for LLMs: Store vectors as memory for LLMs.
  2. Semantic Search: Search based on meanings rather than keywords.
  3. Similarity Search: Find similar vectors for text, images, videos, audio.
  4. Recommendation System: Recommend similar items based on vector similarity.

Popular Vector Databases

  1. ChromaDB: Local vector database.
  2. Pinecone: Cloud-based vector database (requires subscription for multiple clusters).
  3. Weaviate: Cloud-based vector database with extensive features. Also supports JavaScript integration.

ChromaDB Practical Demo

  • Installation & Setup: Using Google Colab for installing necessary packages (chroma DB, openai, langchain, tiktoken).
  • Data Loading: Download data from Dropbox and unzip.
  • API Key Setup: Collect API key from OpenAI and set environment variables.
  • Data Processing: Load, split into chunks, and embed data with OpenAI embeddings.
  • Vector Storage: Store embeddings in ChromaDB and demonstrate querying.
  • Similarity Search: Use LLM to refine and retrieve specific answers.

Pinecone Practical Demo

  • Setup: Account creation, API keys collection, and cluster creation on Pinecone.
  • Data Loading: Loading PDF data, splitting into chunks, and creating embeddings with OpenAI embeddings.
  • Storage & Retrieval: Store vectors in Pinecone, demonstrate similarity search and querying with LLM.

Weaviate Practical Demo

  • Setup: Account creation, API keys, and cluster setup on Weaviate.
  • Data Loading: Extract text from PDF, chunking, and embedding creation.
  • Vector Storage & Retrieval: Store in Weaviate, use similarity search and LLM for querying.

Conclusion

  • Key Takeaways: Vector databases are essential for handling, storing, and querying unstructured data. They offer significant advantages over traditional databases for tasks involving high-dimensional data.
  • Future Work: Plan to create end-to-end projects using vector databases and LLM integration.