Evaluation of RAG Metrics for Question Answering in the Telecom Domain

Authors and Affiliation

Retrieval Augmented Generation (RAG): Used for QA in specialized domains.
Challenges: Evaluation of generated responses in specialized domains.
RAGAS Framework: Used for evaluation but lacks transparency in metric derivation.
Contributions: Modified RAGAS to provide intermediate outputs of metrics for detailed analysis.
Analysis: Expert evaluation in the telecom domain, exploration of metrics under different conditions, and adaptation techniques.

RAG: Enables QA from specific domains leveraging LLMs.
Enhancements to RAG: Techniques like chunk length, order of retrieved context.
Evaluation Challenges: Need metrics considering factualness, relevance, semantic similarity.
Existing Metrics: BLEU, ROUGE, METEOR, BERTScore, etc., have limitations in contextuality and granularity.
RAGAS Framework: Provides metrics like faithfulness, context and answer relevance.
Focus: Evaluation of these metrics in telecom domain using public datasets and domain-adapted retrievers.

Enhanced RAGAS to provide intermediate outputs for metric computation.
Manual evaluation of intermediate outputs for telecom domain.
Established Factual Correctness and Faithfulness as good indicators for expert evaluation.
Observed improvements in metrics with instruction fine-tuning.

Dataset: Subset of Tele-QuAD from 3GPP Release 15 documents.
Retriever Models: Different embedding models evaluated, including domain-pre-trained and fine-tuned.
Generator Models: Evaluated with Mistral-7b and GPT3.5.

Retriever Performance: Accuracies improved with fine-tuning.
RAG Evaluation Results: Show better concordance with human evaluation for Factual Correctness and Faithfulness.
Issues with Some Metrics: Variability and interpretability issues with Answer Relevance and Context Relevance.