WebArena
Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is a strong input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.
How WebArena compares
Frequently asked questions
Is WebArena free and open source?
Yes. WebArena is free and open source. You run it yourself instead of paying per request or per seat, so the only cost is the compute to host the Docker images plus whatever model API calls your agent makes during evaluation. There is no pricing page or commercial tier. It is a research benchmark rather than a hosted product, so plan for self-hosting infrastructure, not a subscription.
Can I self-host WebArena?
Yes. WebArena ships as Docker images you host yourself: a shopping site, a Reddit-style forum, a GitLab instance, a CMS, and OpenStreetMap. Self-hosting is the point. It removes the flakiness of testing agents against the live web. The sites are fully interactive and JavaScript-rendered, so agents do real navigation and form interactions rather than scraping static HTML.
What does WebArena actually measure?
WebArena measures whether a web agent can complete 812 long-horizon tasks across five self-hosted sites, scored by functional success rather than text matching. Tasks are written by humans with concrete success criteria, so an answer counts only if the agent reached the intended end state. In the original paper GPT-4 finished roughly 14 percent of tasks against about 78 percent for humans, which is why it became the standard difficulty reference.
What is WebArena-Verified and should I use it instead?
WebArena-Verified, released in 2025, re-curates all 812 tasks after the original set was found to contain ambiguous instructions and broken evaluations. It repairs misaligned scoring and tightens the comparators, so reported success rates are more trustworthy. If you are publishing or comparing agent numbers now, use the verified version. The original remains useful mainly for reproducing older published baselines.
How does WebArena compare to WebVoyager or Mind2Web?
WebArena tests self-hosted, reproducible sites with functional scoring, so runs stay deterministic, but the sites never change and have no anti-bot defenses. WebVoyager runs against live websites, which is more realistic but flakier and harder to grade. Mind2Web covers a broader set of real sites for action prediction. Use WebArena for stable regression testing, then validate on a live-web benchmark before drawing production conclusions.
When should I choose WebArena over other agent benchmarks?
Choose WebArena when you need a stable, reproducible scoreboard to track an agent across iterations and compare against published numbers, since nearly every agent paper since 2023 reports it. It is a weak proxy for production, because the sites are static, lack live defenses, and high scores do not transfer cleanly. Pair it with a live-web benchmark such as Online-Mind2Web before trusting results for real deployments.
Weekly briefing – tool launches, legal shifts, market data.
Visit
WebArena
