Mind2Web

Generalist web agent benchmark with 2,350 tasks across 137 real websites in 31 domains – measures cross-site, cross-domain transfer rather than single-site mastery.

Maintained by Nathan Kessler·Updated May 7, 2026

Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is the single best input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.

Visit Mind2Web →

Features

✓JS Rendering

✓Structured Output

✓Open Source

—Self-Hosted Option

Pricing:Free

Editorial assessment

The benchmark to use when you care whether your agent generalizes across the web, not just whether it can solve a fixed set of pre-recorded tasks. Original Mind2Web uses snapshotted DOMs (reproducible but stale); Online-Mind2Web (2025) runs the same tasks against the live web and is the current state-of-the-art for measuring real transfer. Frontier models clear ~50% on Online-Mind2Web vs. 97%+ on saturated benchmarks like WebVoyager – the gap is the signal. For AI builders shipping general-purpose web agents, this is the most honest benchmark available. For domain-specific agents (e-commerce, GitHub, calendaring), pair with a vertical benchmark like WebArena's task subsets.

How Mind2Web compares

WebArena

WebArena is more reproducible but tests vertical depth on a fixed set of self-hosted sites – complementary to Mind2Web's breadth.

WebVoyager

WebVoyager is the older live-web benchmark; Mind2Web has roughly succeeded it as the harder, less-saturated option.

ClawBench

ClawBench overlaps on live-web evaluation and adds full behavioral trace capture for failure analysis.

See full Mind2Web alternatives comparison →

Weekly briefing — tool launches, legal shifts, market data.

Visit

Mind2Web

Visit →