serp.fast

WebArena

Reproducible benchmark of 812 long-horizon web tasks across self-hosted realistic websites (e-commerce, forum, GitLab, CMS, maps) – the most-cited agent eval in 2024-2026.

Nathan Kessler
By Nathan KesslerUpdated

Each tool is evaluated against our methodology using public docs, vendor demos, and hands-on testing.

Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is a strong input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.

Features

JS Rendering
Structured Output
Open Source
Self-Hosted Option
Pricing:Free

Editorial assessment

The reference benchmark for evaluating long-horizon web agents – cited in essentially every agent paper since 2023. Self-hosted Docker images for a shopping site, a forum, a GitLab, a CMS, and OpenStreetMap remove the flakiness of testing against the live web. The 812 tasks are written by humans with functional success criteria, not LLM-graded fuzzy matches. The trade-off is that "realistic websites" still aren't the real web – sites don't update, defenses are absent, and high scores don't transfer cleanly to production. WebArena-Verified (2025) re-curates the task set after the original was found to contain ambiguous and broken tasks. Pair WebArena with a live-web benchmark like Online-Mind2Web before drawing product conclusions.

How WebArena compares

WebVoyager

WebVoyager is the live-web counterpart – tasks against real production sites instead of self-hosted clones, but harder to reproduce.

Mind2Web

Mind2Web tests cross-site generalization on a curated set of real sites, complementary to WebArena's vertical-deep evaluation.

ClawBench

ClawBench runs live-web tasks with full behavioral trace capture and request interception – more honest signal than self-hosted benchmarks.

Frequently asked questions

Is WebArena free and open source?

Yes. WebArena is free and open source. You run it yourself instead of paying per request or per seat, so the only cost is the compute to host the Docker images plus whatever model API calls your agent makes during evaluation. There is no pricing page or commercial tier. It is a research benchmark rather than a hosted product, so plan for self-hosting infrastructure, not a subscription.

Can I self-host WebArena?

Yes. WebArena ships as Docker images you host yourself: a shopping site, a Reddit-style forum, a GitLab instance, a CMS, and OpenStreetMap. Self-hosting is the point. It removes the flakiness of testing agents against the live web. The sites are fully interactive and JavaScript-rendered, so agents do real navigation and form interactions rather than scraping static HTML.

What does WebArena actually measure?

WebArena measures whether a web agent can complete 812 long-horizon tasks across five self-hosted sites, scored by functional success rather than text matching. Tasks are written by humans with concrete success criteria, so an answer counts only if the agent reached the intended end state. In the original paper GPT-4 finished roughly 14 percent of tasks against about 78 percent for humans, which is why it became the standard difficulty reference.

What is WebArena-Verified and should I use it instead?

WebArena-Verified, released in 2025, re-curates all 812 tasks after the original set was found to contain ambiguous instructions and broken evaluations. It repairs misaligned scoring and tightens the comparators, so reported success rates are more trustworthy. If you are publishing or comparing agent numbers now, use the verified version. The original remains useful mainly for reproducing older published baselines.

How does WebArena compare to WebVoyager or Mind2Web?

WebArena tests self-hosted, reproducible sites with functional scoring, so runs stay deterministic, but the sites never change and have no anti-bot defenses. WebVoyager runs against live websites, which is more realistic but flakier and harder to grade. Mind2Web covers a broader set of real sites for action prediction. Use WebArena for stable regression testing, then validate on a live-web benchmark before drawing production conclusions.

When should I choose WebArena over other agent benchmarks?

Choose WebArena when you need a stable, reproducible scoreboard to track an agent across iterations and compare against published numbers, since nearly every agent paper since 2023 reports it. It is a weak proxy for production, because the sites are static, lack live defenses, and high scores do not transfer cleanly. Pair it with a live-web benchmark such as Online-Mind2Web before trusting results for real deployments.

Weekly briefing – tool launches, legal shifts, market data.

Visit

WebArena

Visit →