🔍

Evaluating Web Agents with LLMs

Apr 25, 2025

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Authors and Affiliations

  • Xing Han L, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano
  • Affiliated with McGill University, Mila Quebec AI Institute, Google DeepMind, Canada CIFAR AI Chair, Polytechnique Montréal, ServiceNow Research

Abstract

  • Purpose: To evaluate web agents through automatic evaluations using Large Language Models (LLMs)
  • Problem: Rule-based methods are hard to extend and may not recognize all successful trajectories, while human evaluations are time-consuming and costly.
  • Solution: Propose "AgentRewardBench" to benchmark LLMs in evaluating web agents.
  • Content: Contains 1302 trajectories across 5 benchmarks and 4 LLMs, each reviewed by an expert.
  • Findings: No single LLM excels across all benchmarks, and rule-based evaluations tend to underreport success.

Introduction

  • Web Agents: Enable users to perform tasks via natural language on web browsers.
  • LLM Capability: LLMs can interact with web browsers, extending beyond a chat interface.
  • Benchmark Need: A well-designed benchmark should include realistic tasks across various websites.
  • Example Task: Find a Google Pixel phone listing and submit an offer. Rule-based methods are inadequate here.

Related Works

Web Agents and Environments

  • Early methods: program-based heuristics
  • Recent focus: reinforcement learning (RL) models, language models, and multimodal models.
  • Evolution of benchmarks: from simplified to realistic environments

LLM Judges

  • LLMs can evaluate the output of chat models similar to human judgments.
  • Recent works use LLMs to assess web agent trajectories.

Trajectory Synthesis

  • Recent approaches generate trajectories without human supervision, leveraging LLM judges.

AGENT REWARD BENCH

Assessment Framework

  • Trajectory Definition: Defined as a sequence of observations and actions.
  • Annotation Setup: Expert annotators review trajectories to label success, side effects, and repetition.
  • Judge Model: Evaluates trajectories and provides judgments that can be used in RL or automatic evaluations.

Tasks and Environments

  • Five benchmarks are used, including WebArena, VisualWebArena, AssistantBench, WorkArena, and WorkArena++.

Web Agents Design

  • Utilizes two commercial LLMs (GPT-4o and Claude 3.7 Sonnet) and two open-weight LLMs.
  • Agents Platform: Uses AgentLab and BrowserGym for design and execution.

LLM Judges for Web Tasks

Judge Implementations

  • AER and NNetNav: Existing LLM judge implementations.
  • Simplified Judge: Proposed to predict success, side effects, and repetition.

Evaluation

  • Precision is used as a primary metric with recall and F1 as auxiliary scores.
  • Judge Performance: No judge consistently excels across all benchmarks.

Impact of Input Representation

  • Screenshots versus accessibility trees: different effects on performance.

Revisiting Task Success Rate Evaluation

  • LLM judges often overestimate, while rule-based evaluations underestimate success rates.

Error Analysis

  • Identifies common errors in LLM judgments, such as grounding mismatch and misleading agent reasoning.

Conclusion

  • AgentRewardBench provides insights into LLM performance and areas needing improvement.
  • Enhances the design of automatic evaluators and reward models for web agents.

Acknowledgments

  • Funding from Natural Sciences and Engineering Research Council of Canada (NSERC) and Google-Mila grant.
  • Thanks to Alexandre Lacoste, Shikhar Murty, and the McGill NLP group for discussions.