Jul 3, 2024
Data: CSV file containing 1,000 sentences (random sentences as an example of book summaries).
Goal: Enable semantic search for book summaries beyond keyword search.
Embeddings: OpenAI Embeddings API is used to convert sentences into embeddings (vectors of 1,536 dimensions).
Loading Data: Using pandas library to load and display the dataset.
Generating Embeddings: Custom function using OpenAI API to generate embeddings.
def get_embedding(sentence):
# Function to generate embedding for a given sentence using OpenAI API
Index Type: index_flat_l2
Characteristic: Stores all vectors and searches by comparing query vector to each stored vector (not scalable for large datasets).
index = faiss.IndexFlatL2(d)
index.add(embeddings)
``
Improvement: Uses Inverted File (IVF) indexing for clustering vectors into cells (Voronoid diagram).
Algorithm: index_ivf_flat
Training Required: Index needs to be trained with a subset of embeddings.
nprobe: Parameter to extend the search to neighboring cells.
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.train(embeddings)
Compression: Uses Product Quantization (PQ) to reduce vector dimensions, storing more efficient representations for faster search.
Algorithm: index_ivfpq
Training Required: Similar process to IVF Flat, with additional PQ compression.
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
index.train(embeddings)
Setup: Requires Pine Cone API key and account setup.
Serverless Database: Easy creation and handling of vector store databases without managing infrastructure.
Index Creation: Defining index dimensions and metric (e.g., cosine similarity).
pinecone.init(api_key='YOUR_API_KEY')
pinecone.create_index(name='example_index', dimension=d, metric='cosine')
Data Insertion & Query: Supports batch upsert operations for efficient data insertion and similarity search with meta data retrieval.
index.upsert(vectors)
index.query(query_vector, top_k=3)