WebVoyager
Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is a strong input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.
How WebVoyager compares
Frequently asked questions
Is WebVoyager free?
Yes. WebVoyager is a free, open-source research benchmark released under the Apache-2.0 license, so there is no purchase or subscription. The benchmark code and task set cost nothing to download. Running the evaluation against a model does carry indirect cost, because the original setup drives a multimodal model through an external API, and those inference calls are billed by whichever model provider you use.
Is WebVoyager open source?
Yes. WebVoyager is open source under the Apache-2.0 license. The agent code, prompts, and 643-task benchmark are published on GitHub by the original research team, so you can read it, fork it, and adapt the evaluation harness for your own agents. It is a benchmark and reference agent, not a hosted product, so there is no commercial vendor or proprietary tier behind it.
What does WebVoyager actually test?
WebVoyager measures end-to-end web agents on 643 tasks across 15 live production sites, among them Amazon, Apple, Allrecipes, ArXiv, BBC News, GitHub, and several Google services. Tasks run against the real web rather than sandboxed clones, and cover shopping, search, and reservation flows. The agent works from screenshots with labeled interactive elements, so the benchmark tests multimodal navigation rather than HTML parsing alone.
Is WebVoyager still a useful benchmark in 2026?
It is now mostly a smoke test. The original GPT-4V baseline scored roughly 55 percent, but by 2026 frontier models exceed 97 percent. The benchmark is effectively saturated and no longer separates strong agents from average ones. The practical read is a floor: if your agent cannot clear about 80 percent on WebVoyager, it is not production-ready. Above that floor the score stops being informative.
What is the best alternative to WebVoyager?
For ranking modern agents that already saturate WebVoyager, move to harder live-web suites. Online-Mind2Web and the ClawBench live-task suite still discriminate between current frontier agents. WebArena and Mind2Web are other common reference points, though WebArena uses self-hosted site clones rather than the live web. Use WebVoyager as an entry-level check, then switch to one of these for a meaningful head-to-head comparison.
WebVoyager vs WebArena: how do they differ?
Both evaluate web agents, but the testbed differs. WebVoyager runs tasks against 15 real production websites, so results reflect live, changing pages. WebArena instead uses self-hosted, reproducible clones of sites such as a shopping app and a forum app, which makes runs deterministic and offline but less representative of the open web. Choose WebVoyager for realism and WebArena for repeatable, controlled scoring without external site drift.
Weekly briefing – tool launches, legal shifts, market data.
Visit
WebVoyager
