AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Authors and Affiliations
Xing Han L, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano
Affiliated with McGill University, Mila Quebec AI Institute, Google DeepMind, Canada CIFAR AI Chair, Polytechnique Montréal, ServiceNow Research
Abstract
Purpose: To evaluate web agents through automatic evaluations using Large Language Models (LLMs)
Problem: Rule-based methods are hard to extend and may not recognize all successful trajectories, while human evaluations are time-consuming and costly.
Solution: Propose "AgentRewardBench" to benchmark LLMs in evaluating web agents.
Content: Contains 1302 trajectories across 5 benchmarks and 4 LLMs, each reviewed by an expert.
Findings: No single LLM excels across all benchmarks, and rule-based evaluations tend to underreport success.
Introduction
Web Agents: Enable users to perform tasks via natural language on web browsers.
LLM Capability: LLMs can interact with web browsers, extending beyond a chat interface.
Benchmark Need: A well-designed benchmark should include realistic tasks across various websites.
Example Task: Find a Google Pixel phone listing and submit an offer. Rule-based methods are inadequate here.
Related Works
Web Agents and Environments
Early methods: program-based heuristics
Recent focus: reinforcement learning (RL) models, language models, and multimodal models.
Evolution of benchmarks: from simplified to realistic environments
LLM Judges
LLMs can evaluate the output of chat models similar to human judgments.
Recent works use LLMs to assess web agent trajectories.
Trajectory Synthesis
Recent approaches generate trajectories without human supervision, leveraging LLM judges.
AGENT REWARD BENCH
Assessment Framework
Trajectory Definition: Defined as a sequence of observations and actions.
Annotation Setup: Expert annotators review trajectories to label success, side effects, and repetition.
Judge Model: Evaluates trajectories and provides judgments that can be used in RL or automatic evaluations.
Tasks and Environments
Five benchmarks are used, including WebArena, VisualWebArena, AssistantBench, WorkArena, and WorkArena++.
Web Agents Design
Utilizes two commercial LLMs (GPT-4o and Claude 3.7 Sonnet) and two open-weight LLMs.
Agents Platform: Uses AgentLab and BrowserGym for design and execution.
LLM Judges for Web Tasks
Judge Implementations
AER and NNetNav: Existing LLM judge implementations.
Simplified Judge: Proposed to predict success, side effects, and repetition.
Evaluation
Precision is used as a primary metric with recall and F1 as auxiliary scores.
Judge Performance: No judge consistently excels across all benchmarks.
Impact of Input Representation
Screenshots versus accessibility trees: different effects on performance.
Revisiting Task Success Rate Evaluation
LLM judges often overestimate, while rule-based evaluations underestimate success rates.
Error Analysis
Identifies common errors in LLM judgments, such as grounding mismatch and misleading agent reasoning.
Conclusion
AgentRewardBench provides insights into LLM performance and areas needing improvement.
Enhances the design of automatic evaluators and reward models for web agents.
Acknowledgments
Funding from Natural Sciences and Engineering Research Council of Canada (NSERC) and Google-Mila grant.
Thanks to Alexandre Lacoste, Shikhar Murty, and the McGill NLP group for discussions.