Chat Completion Process with LLMs

Aug 5, 2024

Chat Completion Process Using Large Language Models

Overview

Familiarity with chat completion process using large language models (LLMs).
Focus on OpenAI's LLMs, specifically for creating chatbots.

Ingestion Phase

Initial Step: Convert documents (usually PDFs) to text files.
Chunking Requirement: If text files exceed context window, they need to be divided into smaller chunks.

Chunking Methodology

Existing solutions: Many Python scripts available (by sentence, paragraph, etc.).
Developed Methodology: Semantic chunking.

Semantic Chunking

Purpose: To embed large PDF documents into a vector store for querying during chat completions.
Context Window Limits:
- OpenAI's DPT 3.5-Turbo: 4096 tokens (~8 pages).
- GPT-4: 8000 tokens (~16 pages), expanding to 32,000 tokens (~50 pages).
Cost Consideration:
- GPT 4 8k: $0.06 per 1000 tokens.
- GPT 4 32k: $0.12 per 1000 tokens.
- GPT 3.5-Turbo: $0.002 per 1000 tokens (preferred for cost-effectiveness).

Document Types and Structure

Working with various document types:
- Legal contracts, regulatory codes, policy manuals, research papers, news articles, blog posts.
Common Hierarchical Structures:
- Regulatory codes: Title, division, part, chapter, article, section.
- News articles: Title and subtopics.
- Legal agreements: Chapter, sub-chapter, articles.

Semantic Schema

Goal: Break down documents into basic semantic ideas represented in their hierarchy.
Application: Create a question and answer knowledge base for California real estate law.
- Target users: Individuals studying for the California real estate exam, existing Realtors/Brokers, and legal experts.
- Resource: California Department of Real Estate website, focusing on real estate law book and reference guide.

Implementation Steps for Semantic Chunking

Determine Semantic Hierarchy of the document.
Identify Base Semantic Element for chunking (e.g., chapter, article).
Export Document to plain text format.
Implement Code to chunk text according to specifications.
- Challenge: Requires knowledge of text parsing.
- Previous experience with Perl scripts for marking up regulatory documents.
- Use regular expressions for coding.

Alternatives and Conclusion

Other chunking approaches and existing code available.
Presented semantic chunking as a unique and effective method.
Excitement about the journey and exploration of this methodology.

Full transcript