🔍

Understanding Observability and Hallucinations

May 22, 2025

Lecture Notes: Observability and Hallucinations in Machine Learning

Introduction

  • Presenter: AAR, co-founder at Arise
  • Guest Speaker: Trevor, Machine Learning Solutions Engineer at Arise
    • Role involves working closely with customers on observability, data ingestion, setting up monitors and dashboards.

Overview of Lecture

  • Focus on hallucinations in machine learning models.
  • Discuss detection, measurement, and evaluation of hallucinations.
  • Practical hands-on session with a Colab notebook.

Understanding Hallucinations

  • Definition: Model outputs text that is not factually correct or logical.
  • Types of Hallucinations:
    1. Public Data Hallucinations: Errors based on foundation model training data. Example: Incorrectly stating historical facts.
    2. Private Data Hallucinations: Errors occur with the context of private data given to the model.

Importance of Addressing Hallucinations

  • Risks:
    • Providing incorrect product details can have legal or customer service implications.
    • Pulling irrelevant information or exposing sensitive customer data.

Causes of Hallucinations

  • Poor Ordering of Retrieved Chunks:
    • Relevant information may be buried, leading to incorrect responses.

Detecting Hallucinations

  • Limited Manual Checking: Impractical to manually evaluate every response.
  • Automated Detection with LLMs:
    • Using LLMs to evaluate performance of other LLMs.

Evaluations (Evals)

  • Task vs Model-Based Evals:
    • Model Evals: Compare performance across different models.
    • Task Evals: Evaluate how well a model performs a task with consistent settings.
  • Examples: Retrieval Q&A, User frustration, Toxicity, Summarization.

Pre-tested Eval Benchmarks

  • Use of QA and RAG datasets for benchmarking hallucination evals.
  • Importance of confusion matrices over simple accuracy metrics.

Mitigating Hallucinations

  • Improving Retrieval and Ranking: Adjust based on relevance to the query.
  • Enhancing Chunking Strategy: Improve document quality and cohesiveness.
  • Prompt Engineering: Iterate on prompts to refine model responses.

Industry Trends and Research

  • Increasing research on hallucinations, especially with private data.
  • Emergence of guardrails like Nvidia's Nemo Guardrails.

Practical Demonstration

  • Eval Creation Process:

    1. Define metric and build a golden dataset.
    2. Decide on evaluation LLM.
    3. Construct eval templates.
    4. Benchmark and iterate evals.
  • Running Evals on Data:

    • Use confusion matrices to assess performance.
    • Evaluate templates on benchmark datasets and in production environments.

Conclusion

  • Participation: Encouragement for audience to explore and give feedback.
  • Resources: Access to open-source templates and documentation.