Coconote
AI notes
AI voice & video notes
Try for free
📊
Evaluating RAG Metrics in Telecom QA
Nov 27, 2024
🃏
Review flashcards
Evaluation of RAG Metrics for Question Answering in the Telecom Domain
Authors and Affiliation
Sujoy Roychowdhury
Sumit Soman
H G Ranjani
Neeraj Gunda
Vansh Chhabra
Sai Krishna Bala
Affiliated with Ericsson R&D, Bangalore, India
Abstract
Retrieval Augmented Generation (RAG):
Used for QA in specialized domains.
Challenges:
Evaluation of generated responses in specialized domains.
RAGAS Framework:
Used for evaluation but lacks transparency in metric derivation.
Contributions:
Modified RAGAS to provide intermediate outputs of metrics for detailed analysis.
Analysis:
Expert evaluation in the telecom domain, exploration of metrics under different conditions, and adaptation techniques.
Introduction
RAG:
Enables QA from specific domains leveraging LLMs.
Enhancements to RAG:
Techniques like chunk length, order of retrieved context.
Evaluation Challenges:
Need metrics considering factualness, relevance, semantic similarity.
Existing Metrics:
BLEU, ROUGE, METEOR, BERTScore, etc., have limitations in contextuality and granularity.
RAGAS Framework:
Provides metrics like faithfulness, context and answer relevance.
Focus:
Evaluation of these metrics in telecom domain using public datasets and domain-adapted retrievers.
Research Questions
RAGAS Step-by-Step Evaluation:
How it processes evaluations.
Appropriateness for Telecom QA:
Evaluate RAGAS metrics.
Effect of Retriever Performance and Embeddings:
On RAGAS metrics.
Contributions
Enhanced RAGAS to provide intermediate outputs for metric computation.
Manual evaluation of intermediate outputs for telecom domain.
Established Factual Correctness and Faithfulness as good indicators for expert evaluation.
Observed improvements in metrics with instruction fine-tuning.
Experimental Setup
Dataset:
Subset of Tele-QuAD from 3GPP Release 15 documents.
Retriever Models:
Different embedding models evaluated, including domain-pre-trained and fine-tuned.
Generator Models:
Evaluated with Mistral-7b and GPT3.5.
Metrics Evaluated
Faithfulness
: Statement presence in retrieved context.
Answer Relevance
: Cosine similarity of generated and question embeddings.
Context Relevance
: Relevance of context to question.
Answer Similarity
: Similarity between generated and ground truth answer.
Factual Correctness
: F1-Score based on TP, FP, FN with LLM classification.
Answer Correctness
: Weighted sum of Factual Correctness and Answer Similarity.
Results and Discussion
Retriever Performance:
Accuracies improved with fine-tuning.
RAG Evaluation Results:
Show better concordance with human evaluation for Factual Correctness and Faithfulness.
Issues with Some Metrics:
Variability and interpretability issues with Answer Relevance and Context Relevance.
Conclusions and Future Work
Enhanced RAGAS for better insight into metric computation.
Challenges:
With AnsSim, ConRel, and AnsRel as reliable metrics.
Domain Adaptation:
Improves metric alignment with SME evaluations.
Further study of other RAGAS dependent libraries recommended.
References
Includes works on RAG, evaluation metrics, embeddings, and domain adaptation.
Appendices
Details of RAGAS metric computation and sample outputs.
🔗
View note source
https://openreview.net/pdf?id=L74piNoToX