🧠

Workshop on LLM Hallucination Evaluation

May 22, 2025

LLM Hallucinations and Evaluation Framework Workshop

Introduction

  • Host: Diana Chan Morgan from deeplearning.ai
  • Guests: Special guests from Galileo
  • Session Info:
    • Recorded and available for replay
    • Use chat link for questions
    • Developer Evangelists Setu and Jonathan available in chat

Event and Workshop Overview

  • Focus on:
    • Research-backed metrics for evaluating input/output quality
    • Hallucinations in LLM applications
    • Evaluation and experimentation framework
    • Prompt engineering with RAG
    • Fine-tuning with personal data
    • Demo for implementation

Speakers Introduction

  • Vikram Chathy: Co-founder and CEO at Galileo
    • Background: Product management at Google AI, Google Pay
  • Endrio Son: Co-founder and CTO at Galileo
    • Background: Engineering leader at Uber AI, Michelangelo, Siri at Apple

Workshop Agenda

  1. Need for LLM Experimentation and Evaluation Frameworks

    • Challenges with LLMs: Inputs and Outputs complexity
    • Importance of iterative experimentation and evaluation
  2. Different Evaluation Methods and Usage

    • Output evaluation requires domain-specific metrics
    • Existing limitations in current methods
  3. Emerging Hallucination Detection Metrics

    • Development of new metrics tailored for LLMs
    • Significance of understanding and mitigating hallucinations
  4. Demo Walkthrough

    • Application of discussed methods and metrics
    • Implement mitigation strategies

LLMs Experimentation Details

  • Input Side:
    • Complexity in model parameters, prompt templates, RAG context, etc.
  • Output Side:
    • Diverse use cases require different evaluation methods
  • Framework Necessity:
    • Need for concrete metrics to guide experimentation and evaluation

Galileo’s G LLM Studio

  • Components:
    1. Prompt: Accelerate prompt engineering
    2. Fine Tune: High-quality data fine-tuning
    3. Monitoring: Real-time observability
  • Griel Metric Store:
    • Research-backed metrics to reduce hallucinations

Hallucination Detection Techniques

  1. Engram Matching: Measures output-reference overlap

    • Limitations: Requires ground truth, adaptability issues
  2. Self-checking via LLM: LLMs assess own outputs

    • Limitations: Blackbox approach, lacks explainability
  3. Statistical Properties: Token confidence analysis

    • Limitations: Granular, limited LLM API support

Chain Pole Method

  • Approach: Combines features of existing methods
  • Phases:
    1. Chaining: Detailed Chain of Thought prompting
    2. Polling: Ensembling technique to detect hallucinations
  • Benefits:
    • High efficacy, low latency, cheap computation

Demonstration Highlights

  • Prompt Engineering with RAG: Steps and metrics used
  • Fine-Tuning with Own Data: Importance of data quality

Observability in Production

  • Key Metrics:
    • Cost, latency, API failure rates
    • LLM correctness, context adherence

Conclusion

  • Takeaway:
    • Metrics are crucial for guiding LLM experimentation
    • Emphasis on continuous improvement and real-time monitoring

Q&A Highlights

  • Addressing hallucinations in real-time outputs
  • Ensuring reliability with non-deterministic LLM behavior

Closing Remarks

  • Survey for feedback
  • Encouragement for continued learning and exploration in AI

Note: This summary captures key points from the workshop for effective study and review of LLM hallucination evaluation and mitigation strategies.