Editor's pickClawBench
Open source benchmark evaluating AI browser agents on 153 everyday tasks across 144 live websites, with request interception and full behavioral trace capture.
FreeView details →
Editor's pickWebArena
Reproducible benchmark of 812 long-horizon web tasks across self-hosted realistic websites (e-commerce, forum, GitLab, CMS, maps) – the most-cited agent eval in 2024-2026.
FreeView details →
Editor's pickMind2Web
Generalist web agent benchmark with 2,350 tasks across 137 real websites in 31 domains – measures cross-site, cross-domain transfer rather than single-site mastery.
FreeView details →
Editor's pickOSWorld
Computer-use benchmark with 369 real tasks across Ubuntu, Windows, and macOS environments – the reference eval for agents that act on full operating systems, not just browsers.
FreeView details →
Editor's pickWebVoyager
Live-web benchmark of 643 tasks across 15 real websites (Allrecipes, Amazon, Apple, ArXiv, BBC News, GitHub, Google variants, etc.) for end-to-end multimodal web agents.
FreeView details →