Jun 4, 2025
Manual Evaluation
Automated Tools and Libraries
Evaluation Harness
Optimum Benchmark
Setup Instructions
Task List and Selection
H-Swag (Commonsense NLI Dataset)
Truthful QA
General Steps
Analysis and Results
Full transcript