🧠

Detecting Hallucinations in AI Models

May 22, 2025

Lecture Notes on Detecting Hallucinations in LLM Applications

Introduction to the Speaker and Topic

  • Speaker: Alex Thomas, Principal Data Scientist at y Cube.
  • Topic: Detecting hallucinations in AI and measurement systems for detection.

Background on LLMs and Hallucination

  • LLMs: Large Language Models used for processing natural language with significant advancements.
  • Problem: LLMs can hallucinate, i.e., generate false information which can:
    • Spread misinformation.
    • Erode trust and brand reputation.
    • Lead to legal and customer trust issues.
  • Causes: Bias inherent in LLMs, user-framing issues.

Examples of Hallucinations Impact

  • Example: Air Canada's chatbot invented a refund policy.
  • Impact: Legal and brand reputation damage.

Measuring Hallucinations

  • Terminology:
    • Application: LLM-based application (e.g., chatbot, summarization system).
    • Measurement System: System for measuring the LLM application.

Types of LLM Tasks

  1. Zero Context

    • Basic chatbot experience with small inputs/outputs.
    • Challenges: External data source needed for factual accuracy.
    • Susceptible to user framing.
  2. RAG Q&A (Retrieval Augmented Generation)

    • Combines user input with retrieved text for grounded answers.
    • Challenges: Multiple sources of error (LLM hallucination, incorrect retrieved data).
  3. Summarization

    • Large input, smaller output.
    • Complex accuracy concept.
    • Challenges: Maintaining factual integrity when summarizing potentially false articles.

Establishing Benchmarks and Metrics

  • Challenges:
    • Lack of established metrics or conventions.
    • Focusing on factual accuracy over text fluency.
  • Measurement Strategies:
    • Pythia Strategy: Extracts claims as triples for evaluation.
    • Grading Strategy: Simple letter grade evaluation.
    • Lynx Strategy: Pass/fail evaluation focused on faithfulness.

Models Tested

  • Models: GPT-4.0 and LLaMA 3.1 (70 billion parameters).

Measuring Measurement Systems

  • Comparison with NLP Benchmarking:
    • Old benchmarks specialized by tasks/domains.
    • New LLM benchmarks murkier, less established.

Data Sets for Measuring

  • Summarization Data Sets:
    • CNN/Daily Mail, BBC articles.
    • Expert vs. Mechanical Turker labels.
  • RAG Q&A Data Set:
    • Based on binary pass/fail labels.

Metrics for Evaluation

  • Spearman Correlation: Issues with low granularity labels.
  • Mean Absolute Error: Better for summarization evaluation.
  • Accuracy: Used for binary classification in RAG Q&A.

Conclusion

  • Complexity: Measuring LLM applications requires sophisticated systems.
  • Understanding: Need to understand data, metrics, and system tuning.
  • Future Work: Testing more strategies, models, and exploring calibration.

Questions and Recommendations

  • Data Comparison: Compare distribution of open data sets with your own.
  • Industry-Specific Use: Pythia can be adapted for various industries, though specialization is beneficial.

Final Notes

  • Resources: Links available for trying out systems.
  • Webinar Continuation: Stay tuned for future developments.