🤖

Creating AI Agents with Open Source Tools

Feb 15, 2025

Building AI Agents with Open Source Data Extraction

Introduction

Importance of giving AI agents access to data for specific knowledge about a company or problem.
Many online tools exist, often closed source, requiring API keys.
Open-source alternatives like DocLink are available.
Focus on Python for building a document extraction pipeline using DocLink.

Objectives

Build a fully open-source document extraction pipeline.
Parse PDFs and websites for a chat application.
Utilize extraction, parsing, chunking, embedding, and retrieval techniques.

Setup and Requirements

GitHub Repository: Contains necessary files and information.
Environment Setup: Requires Python environment, OpenAI API key.
Install dependencies from requirements.txt.

Document Extraction Using DocLink

Open-source library from IBM, popular in the AI engineering community.
Capable of parsing multiple formats: PDF, PowerPoint, DOCX, websites.
Converts files into a DocLink document object for unified data handling.
Supports Markdown and JSON exports.
Excels in table extraction, often a challenge for other libraries.

Step-by-Step Process

1. Document Extraction

Use DocLink to convert PDFs into structured data objects.
Supports Optical Character Recognition (OCR) for data modeling.

2. HTML and Website Extraction

Convert HTML to DocLink objects.
Use sitemap.xml to extract all URLs from a website.
DocLink's convert_all method extracts entire websites.

3. Chunking

Splits documents into logical chunks using DocLink's hierarchical and hybrid chunking.
Hybrid chunking fits chunks to the context size of embedding models.

4. Embedding

Store document embeddings in a vector database using Lansdb.
Use OpenAI for embedding creation.
Pydantic Models: Define structure for vector database entries.

5. Vector Database and Retrieval

Use Lansdb for simple vector storage and retrieval.
Store document text, embeddings, and metadata.
Query database using similarity search with embeddings.

Application Development

Interactive Chat Application

Built using Streamlit for demo purposes.
Connects to vector database for context retrieval.
Provides interactive chat interface with retrieval and citation of sources.

Conclusion

Successfully built a knowledge extraction system using open-source tools.
Demonstrated extraction, chunking, embedding, and retrieval techniques.
Application scales easily by adding more data.
Encouraged to expand and customize further.

Resources

Consider joining free courses and communities to learn more about Python for AI.
Look into freelance opportunities if interested in expanding skills beyond videos.

Full transcript