AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Authors and Affiliations

Xing Han L, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano
Affiliated with McGill University, Mila Quebec AI Institute, Google DeepMind, Canada CIFAR AI Chair, Polytechnique Montréal, ServiceNow Research

Abstract

Purpose: To evaluate web agents through automatic evaluations using Large Language Models (LLMs)
Problem: Rule-based methods are hard to extend and may not recognize all successful trajectories, while human evaluations are time-consuming and costly.
Solution: Propose "AgentRewardBench" to benchmark LLMs in evaluating web agents.
Content: Contains 1302 trajectories across 5 benchmarks and 4 LLMs, each reviewed by an expert.
Findings: No single LLM excels across all benchmarks, and rule-based evaluations tend to underreport success.

Introduction

Web Agents: Enable users to perform tasks via natural language on web browsers.
LLM Capability: LLMs can interact with web browsers, extending beyond a chat interface.
Benchmark Need: A well-designed benchmark should include realistic tasks across various websites.
Example Task: Find a Google Pixel phone listing and submit an offer. Rule-based methods are inadequate here.

Related Works

Web Agents and Environments

Early methods: program-based heuristics
Recent focus: reinforcement learning (RL) models, language models, and multimodal models.
Evolution of benchmarks: from simplified to realistic environments

LLM Judges

LLMs can evaluate the output of chat models similar to human judgments.
Recent works use LLMs to assess web agent trajectories.

Trajectory Synthesis

Recent approaches generate trajectories without human supervision, leveraging LLM judges.

AGENT REWARD BENCH

Assessment Framework

Trajectory Definition: Defined as a sequence of observations and actions.
Annotation Setup: Expert annotators review trajectories to label success, side effects, and repetition.
Judge Model: Evaluates trajectories and provides judgments that can be used in RL or automatic evaluations.

Tasks and Environments

Five benchmarks are used, including WebArena, VisualWebArena, AssistantBench, WorkArena, and WorkArena++.

Web Agents Design

Utilizes two commercial LLMs (GPT-4o and Claude 3.7 Sonnet) and two open-weight LLMs.
Agents Platform: Uses AgentLab and BrowserGym for design and execution.

LLM Judges for Web Tasks

Judge Implementations

AER and NNetNav: Existing LLM judge implementations.
Simplified Judge: Proposed to predict success, side effects, and repetition.

Evaluation

Precision is used as a primary metric with recall and F1 as auxiliary scores.
Judge Performance: No judge consistently excels across all benchmarks.

Impact of Input Representation

Screenshots versus accessibility trees: different effects on performance.

Revisiting Task Success Rate Evaluation

LLM judges often overestimate, while rule-based evaluations underestimate success rates.

Error Analysis

Identifies common errors in LLM judgments, such as grounding mismatch and misleading agent reasoning.

Conclusion

AgentRewardBench provides insights into LLM performance and areas needing improvement.
Enhances the design of automatic evaluators and reward models for web agents.

Acknowledgments

Funding from Natural Sciences and Engineering Research Council of Canada (NSERC) and Google-Mila grant.
Thanks to Alexandre Lacoste, Shikhar Murty, and the McGill NLP group for discussions.

Evaluating Web Agents with LLMs

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Authors and Affiliations

Abstract

Introduction

Related Works

Web Agents and Environments

LLM Judges

Trajectory Synthesis

AGENT REWARD BENCH

Assessment Framework

Tasks and Environments

Web Agents Design

LLM Judges for Web Tasks

Judge Implementations

Evaluation

Impact of Input Representation

Revisiting Task Success Rate Evaluation

Error Analysis

Conclusion

Acknowledgments