Coconote
AI notes
AI voice & video notes
Try for free
ðŸ§
Detecting Hallucinations in AI Models
May 22, 2025
Lecture Notes on Detecting Hallucinations in LLM Applications
Introduction to the Speaker and Topic
Speaker
: Alex Thomas, Principal Data Scientist at y Cube.
Topic
: Detecting hallucinations in AI and measurement systems for detection.
Background on LLMs and Hallucination
LLMs
: Large Language Models used for processing natural language with significant advancements.
Problem
: LLMs can hallucinate, i.e., generate false information which can:
Spread misinformation.
Erode trust and brand reputation.
Lead to legal and customer trust issues.
Causes
: Bias inherent in LLMs, user-framing issues.
Examples of Hallucinations Impact
Example
: Air Canada's chatbot invented a refund policy.
Impact
: Legal and brand reputation damage.
Measuring Hallucinations
Terminology
:
Application
: LLM-based application (e.g., chatbot, summarization system).
Measurement System
: System for measuring the LLM application.
Types of LLM Tasks
Zero Context
Basic chatbot experience with small inputs/outputs.
Challenges: External data source needed for factual accuracy.
Susceptible to user framing.
RAG Q&A (Retrieval Augmented Generation)
Combines user input with retrieved text for grounded answers.
Challenges: Multiple sources of error (LLM hallucination, incorrect retrieved data).
Summarization
Large input, smaller output.
Complex accuracy concept.
Challenges: Maintaining factual integrity when summarizing potentially false articles.
Establishing Benchmarks and Metrics
Challenges
:
Lack of established metrics or conventions.
Focusing on factual accuracy over text fluency.
Measurement Strategies
:
Pythia Strategy
: Extracts claims as triples for evaluation.
Grading Strategy
: Simple letter grade evaluation.
Lynx Strategy
: Pass/fail evaluation focused on faithfulness.
Models Tested
Models
: GPT-4.0 and LLaMA 3.1 (70 billion parameters).
Measuring Measurement Systems
Comparison with NLP Benchmarking
:
Old benchmarks specialized by tasks/domains.
New LLM benchmarks murkier, less established.
Data Sets for Measuring
Summarization Data Sets
:
CNN/Daily Mail, BBC articles.
Expert vs. Mechanical Turker labels.
RAG Q&A Data Set
:
Based on binary pass/fail labels.
Metrics for Evaluation
Spearman Correlation
: Issues with low granularity labels.
Mean Absolute Error
: Better for summarization evaluation.
Accuracy
: Used for binary classification in RAG Q&A.
Conclusion
Complexity
: Measuring LLM applications requires sophisticated systems.
Understanding
: Need to understand data, metrics, and system tuning.
Future Work
: Testing more strategies, models, and exploring calibration.
Questions and Recommendations
Data Comparison
: Compare distribution of open data sets with your own.
Industry-Specific Use
: Pythia can be adapted for various industries, though specialization is beneficial.
Final Notes
Resources
: Links available for trying out systems.
Webinar Continuation
: Stay tuned for future developments.
📄
Full transcript