🔍

Understanding Observability and Hallucinations

May 22, 2025

Lecture Notes: Observability and Hallucinations in Machine Learning

Introduction

Presenter: AAR, co-founder at Arise
Guest Speaker: Trevor, Machine Learning Solutions Engineer at Arise
- Role involves working closely with customers on observability, data ingestion, setting up monitors and dashboards.

Overview of Lecture

Focus on hallucinations in machine learning models.
Discuss detection, measurement, and evaluation of hallucinations.
Practical hands-on session with a Colab notebook.

Understanding Hallucinations

Definition: Model outputs text that is not factually correct or logical.
Types of Hallucinations:
1. Public Data Hallucinations: Errors based on foundation model training data. Example: Incorrectly stating historical facts.
2. Private Data Hallucinations: Errors occur with the context of private data given to the model.

Importance of Addressing Hallucinations

Risks:
- Providing incorrect product details can have legal or customer service implications.
- Pulling irrelevant information or exposing sensitive customer data.

Causes of Hallucinations

Poor Ordering of Retrieved Chunks:
- Relevant information may be buried, leading to incorrect responses.

Detecting Hallucinations

Limited Manual Checking: Impractical to manually evaluate every response.
Automated Detection with LLMs:
- Using LLMs to evaluate performance of other LLMs.

Evaluations (Evals)

Task vs Model-Based Evals:
- Model Evals: Compare performance across different models.
- Task Evals: Evaluate how well a model performs a task with consistent settings.
Examples: Retrieval Q&A, User frustration, Toxicity, Summarization.

Pre-tested Eval Benchmarks

Use of QA and RAG datasets for benchmarking hallucination evals.
Importance of confusion matrices over simple accuracy metrics.

Mitigating Hallucinations

Improving Retrieval and Ranking: Adjust based on relevance to the query.
Enhancing Chunking Strategy: Improve document quality and cohesiveness.
Prompt Engineering: Iterate on prompts to refine model responses.

Industry Trends and Research

Increasing research on hallucinations, especially with private data.
Emergence of guardrails like Nvidia's Nemo Guardrails.

Practical Demonstration

Eval Creation Process:
1. Define metric and build a golden dataset.
2. Decide on evaluation LLM.
3. Construct eval templates.
4. Benchmark and iterate evals.
Running Evals on Data:
- Use confusion matrices to assess performance.
- Evaluate templates on benchmark datasets and in production environments.

Conclusion

Participation: Encouragement for audience to explore and give feedback.
Resources: Access to open-source templates and documentation.

Full transcript