serp.fast

Mind2Web

Generalist web agent benchmark with 2,350 tasks across 137 real websites in 31 domains – measures cross-site, cross-domain transfer rather than single-site mastery.

Nathan Kessler
By Nathan KesslerUpdated

Each tool is evaluated against our methodology using public docs, vendor demos, and hands-on testing.

Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is a strong input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.

Features

JS Rendering
Structured Output
Open Source
Self-Hosted Option
Pricing:Free

Editorial assessment

The benchmark to use when you care whether your agent generalizes across the web, not just whether it can solve a fixed set of pre-recorded tasks. Original Mind2Web uses snapshotted DOMs (reproducible but stale); Online-Mind2Web (2025) runs the same tasks against the live web and is the current state-of-the-art for measuring real transfer. Frontier models clear ~50% on Online-Mind2Web vs. 97%+ on saturated benchmarks like WebVoyager – the gap is the signal. For AI builders shipping general-purpose web agents, this is the most honest benchmark available. For domain-specific agents (e-commerce, GitHub, calendaring), pair with a vertical benchmark like WebArena's task subsets.

How Mind2Web compares

WebArena

WebArena is more reproducible but tests vertical depth on a fixed set of self-hosted sites – complementary to Mind2Web's breadth.

WebVoyager

WebVoyager is the older live-web benchmark; Mind2Web has roughly succeeded it as the harder, less-saturated option.

ClawBench

ClawBench overlaps on live-web evaluation and adds full behavioral trace capture for failure analysis.

Frequently asked questions

What is Mind2Web?

Mind2Web is a benchmark for evaluating generalist web agents. It contains 2,350 crowdsourced tasks spanning 137 real websites across 31 domains, with recorded action sequences for each task. Unlike single-site benchmarks, it measures whether an agent transfers to sites and domains it was not trained on. It came out of the OSU NLP Group and was presented at NeurIPS 2023.

Is Mind2Web free and open source?

Yes. Mind2Web is free and open source. The dataset and code are published on GitHub and Hugging Face under a Creative Commons Attribution-ShareAlike 4.0 license, so you can use and modify it with attribution. There is no paid tier or commercial plan. As an academic benchmark rather than a hosted product, the only cost is the compute you spend running your own agent against the tasks.

What is the difference between Mind2Web and Online-Mind2Web?

The original Mind2Web evaluates against snapshotted DOMs, which is reproducible but goes stale as sites change. Online-Mind2Web, released in 2025, runs a curated set of tasks against the live web, so it tests real interaction with dynamic pages. It is harder than the static version and gives more honest transfer numbers. Use Online-Mind2Web when you want current results on real websites.

How does Mind2Web compare to WebVoyager?

WebVoyager is largely saturated. Frontier models clear 97% or more on it, so it no longer separates strong agents from weak ones. Mind2Web, especially the Online-Mind2Web variant, is far from saturated. Frontier models land around 50% on the live version. That gap is the point. If you want a benchmark that still shows daylight between general-purpose web agents, Mind2Web is the more discriminating choice.

When should I use Mind2Web versus WebArena or ClawBench?

Use Mind2Web when you ship a general-purpose web agent and care whether it generalizes across unfamiliar sites and domains. For domain-specific agents, pair it with WebArena's task subsets, which run in controlled sandboxed environments. ClawBench targets a narrower scraping evaluation. Mind2Web is the breadth-of-transfer test, while the others probe depth within a fixed environment or vertical.

Can Mind2Web be self-hosted?

Mind2Web is not a hosted service you self-deploy. It is a dataset plus evaluation code that you download and run in your own environment, so the data and harness already live wherever you put them. The original tasks run against cached page snapshots, while Online-Mind2Web requires live internet access to reach the real websites it tests. There is no managed cloud version to opt out of.

Weekly briefing – tool launches, legal shifts, market data.

Visit

Mind2Web

Visit →