🤖

Creating AI Agents with Open Source Tools

Feb 15, 2025

Building AI Agents with Open Source Data Extraction

Introduction

  • Importance of giving AI agents access to data for specific knowledge about a company or problem.
  • Many online tools exist, often closed source, requiring API keys.
  • Open-source alternatives like DocLink are available.
  • Focus on Python for building a document extraction pipeline using DocLink.

Objectives

  • Build a fully open-source document extraction pipeline.
  • Parse PDFs and websites for a chat application.
  • Utilize extraction, parsing, chunking, embedding, and retrieval techniques.

Setup and Requirements

  • GitHub Repository: Contains necessary files and information.
  • Environment Setup: Requires Python environment, OpenAI API key.
  • Install dependencies from requirements.txt.

Document Extraction Using DocLink

  • Open-source library from IBM, popular in the AI engineering community.
  • Capable of parsing multiple formats: PDF, PowerPoint, DOCX, websites.
  • Converts files into a DocLink document object for unified data handling.
  • Supports Markdown and JSON exports.
  • Excels in table extraction, often a challenge for other libraries.

Step-by-Step Process

1. Document Extraction

  • Use DocLink to convert PDFs into structured data objects.
  • Supports Optical Character Recognition (OCR) for data modeling.

2. HTML and Website Extraction

  • Convert HTML to DocLink objects.
  • Use sitemap.xml to extract all URLs from a website.
  • DocLink's convert_all method extracts entire websites.

3. Chunking

  • Splits documents into logical chunks using DocLink's hierarchical and hybrid chunking.
  • Hybrid chunking fits chunks to the context size of embedding models.

4. Embedding

  • Store document embeddings in a vector database using Lansdb.
  • Use OpenAI for embedding creation.
  • Pydantic Models: Define structure for vector database entries.

5. Vector Database and Retrieval

  • Use Lansdb for simple vector storage and retrieval.
  • Store document text, embeddings, and metadata.
  • Query database using similarity search with embeddings.

Application Development

Interactive Chat Application

  • Built using Streamlit for demo purposes.
  • Connects to vector database for context retrieval.
  • Provides interactive chat interface with retrieval and citation of sources.

Conclusion

  • Successfully built a knowledge extraction system using open-source tools.
  • Demonstrated extraction, chunking, embedding, and retrieval techniques.
  • Application scales easily by adding more data.
  • Encouraged to expand and customize further.

Resources

  • Consider joining free courses and communities to learn more about Python for AI.
  • Look into freelance opportunities if interested in expanding skills beyond videos.