Coconote
AI notes
AI voice & video notes
Export note
Try for free
Chat Completion Process with LLMs
Aug 5, 2024
Chat Completion Process Using Large Language Models
Overview
Familiarity with chat completion process using large language models (LLMs).
Focus on OpenAI's LLMs, specifically for creating chatbots.
Ingestion Phase
Initial Step
: Convert documents (usually PDFs) to text files.
Chunking Requirement
: If text files exceed context window, they need to be divided into smaller chunks.
Chunking Methodology
Existing solutions: Many Python scripts available (by sentence, paragraph, etc.).
Developed Methodology
: Semantic chunking.
Semantic Chunking
Purpose
: To embed large PDF documents into a vector store for querying during chat completions.
Context Window Limits
:
OpenAI's DPT 3.5-Turbo: 4096 tokens (~8 pages).
GPT-4: 8000 tokens (~16 pages), expanding to 32,000 tokens (~50 pages).
Cost Consideration
:
GPT 4 8k: $0.06 per 1000 tokens.
GPT 4 32k: $0.12 per 1000 tokens.
GPT 3.5-Turbo: $0.002 per 1000 tokens (preferred for cost-effectiveness).
Document Types and Structure
Working with various document types:
Legal contracts, regulatory codes, policy manuals, research papers, news articles, blog posts.
Common Hierarchical Structures
:
Regulatory codes: Title, division, part, chapter, article, section.
News articles: Title and subtopics.
Legal agreements: Chapter, sub-chapter, articles.
Semantic Schema
Goal
: Break down documents into basic semantic ideas represented in their hierarchy.
Application
: Create a question and answer knowledge base for California real estate law.
Target users: Individuals studying for the California real estate exam, existing Realtors/Brokers, and legal experts.
Resource: California Department of Real Estate website, focusing on real estate law book and reference guide.
Implementation Steps for Semantic Chunking
Determine Semantic Hierarchy
of the document.
Identify Base Semantic Element
for chunking (e.g., chapter, article).
Export Document
to plain text format.
Implement Code
to chunk text according to specifications.
Challenge
: Requires knowledge of text parsing.
Previous experience with Perl scripts for marking up regulatory documents.
Use regular expressions for coding.
Alternatives and Conclusion
Other chunking approaches and existing code available.
Presented semantic chunking as a unique and effective method.
Excitement about the journey and exploration of this methodology.
📄
Full transcript