📊

Evaluating RAG Metrics in Telecom QA

Nov 27, 2024

Evaluation of RAG Metrics for Question Answering in the Telecom Domain

Authors and Affiliation

  • Sujoy Roychowdhury
  • Sumit Soman
  • H G Ranjani
  • Neeraj Gunda
  • Vansh Chhabra
  • Sai Krishna Bala
  • Affiliated with Ericsson R&D, Bangalore, India

Abstract

  • Retrieval Augmented Generation (RAG): Used for QA in specialized domains.
  • Challenges: Evaluation of generated responses in specialized domains.
  • RAGAS Framework: Used for evaluation but lacks transparency in metric derivation.
  • Contributions: Modified RAGAS to provide intermediate outputs of metrics for detailed analysis.
  • Analysis: Expert evaluation in the telecom domain, exploration of metrics under different conditions, and adaptation techniques.

Introduction

  • RAG: Enables QA from specific domains leveraging LLMs.
  • Enhancements to RAG: Techniques like chunk length, order of retrieved context.
  • Evaluation Challenges: Need metrics considering factualness, relevance, semantic similarity.
  • Existing Metrics: BLEU, ROUGE, METEOR, BERTScore, etc., have limitations in contextuality and granularity.
  • RAGAS Framework: Provides metrics like faithfulness, context and answer relevance.
  • Focus: Evaluation of these metrics in telecom domain using public datasets and domain-adapted retrievers.

Research Questions

  1. RAGAS Step-by-Step Evaluation: How it processes evaluations.
  2. Appropriateness for Telecom QA: Evaluate RAGAS metrics.
  3. Effect of Retriever Performance and Embeddings: On RAGAS metrics.

Contributions

  1. Enhanced RAGAS to provide intermediate outputs for metric computation.
  2. Manual evaluation of intermediate outputs for telecom domain.
  3. Established Factual Correctness and Faithfulness as good indicators for expert evaluation.
  4. Observed improvements in metrics with instruction fine-tuning.

Experimental Setup

  • Dataset: Subset of Tele-QuAD from 3GPP Release 15 documents.
  • Retriever Models: Different embedding models evaluated, including domain-pre-trained and fine-tuned.
  • Generator Models: Evaluated with Mistral-7b and GPT3.5.

Metrics Evaluated

  • Faithfulness: Statement presence in retrieved context.
  • Answer Relevance: Cosine similarity of generated and question embeddings.
  • Context Relevance: Relevance of context to question.
  • Answer Similarity: Similarity between generated and ground truth answer.
  • Factual Correctness: F1-Score based on TP, FP, FN with LLM classification.
  • Answer Correctness: Weighted sum of Factual Correctness and Answer Similarity.

Results and Discussion

  • Retriever Performance: Accuracies improved with fine-tuning.
  • RAG Evaluation Results: Show better concordance with human evaluation for Factual Correctness and Faithfulness.
  • Issues with Some Metrics: Variability and interpretability issues with Answer Relevance and Context Relevance.

Conclusions and Future Work

  • Enhanced RAGAS for better insight into metric computation.
  • Challenges: With AnsSim, ConRel, and AnsRel as reliable metrics.
  • Domain Adaptation: Improves metric alignment with SME evaluations.
  • Further study of other RAGAS dependent libraries recommended.

References

  • Includes works on RAG, evaluation metrics, embeddings, and domain adaptation.

Appendices

  • Details of RAGAS metric computation and sample outputs.