serp.fast

WebVoyager

Live-web benchmark of 643 tasks across 15 real websites (Allrecipes, Amazon, Apple, ArXiv, BBC News, GitHub, Google variants, etc.) for end-to-end multimodal web agents.

Nathan Kessler
By Nathan KesslerUpdated

Each tool is evaluated against our methodology using public docs, vendor demos, and hands-on testing.

Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is a strong input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.

Features

JS Rendering
Structured Output
Open Source
Self-Hosted Option
Pricing:Free

Editorial assessment

The first widely-adopted live-web benchmark – tasks run against real production sites, not sandboxed clones. The original GPT-4V baseline scored 55%; by 2026 frontier models exceed 97%, which means the benchmark is effectively saturated and no longer a discriminator for new agents. Still useful as a smoke test – if your agent can't clear 80% on WebVoyager, it's not ready for production. For meaningful comparison between modern agents, move to harder live-web benchmarks (Online-Mind2Web) or the ClawBench live-task suite.

How WebVoyager compares

WebArena

WebArena is the reproducible self-hosted alternative – better for ablation studies and reliable scoring, worse for measuring real-world transfer.

Mind2Web

Mind2Web is the more current live-web benchmark, particularly the Online-Mind2Web variant that hasn't saturated.

ClawBench

ClawBench is the live-web benchmark of choice if you want behavioral traces and not just success rates.

Frequently asked questions

Is WebVoyager free?

Yes. WebVoyager is a free, open-source research benchmark released under the Apache-2.0 license, so there is no purchase or subscription. The benchmark code and task set cost nothing to download. Running the evaluation against a model does carry indirect cost, because the original setup drives a multimodal model through an external API, and those inference calls are billed by whichever model provider you use.

Is WebVoyager open source?

Yes. WebVoyager is open source under the Apache-2.0 license. The agent code, prompts, and 643-task benchmark are published on GitHub by the original research team, so you can read it, fork it, and adapt the evaluation harness for your own agents. It is a benchmark and reference agent, not a hosted product, so there is no commercial vendor or proprietary tier behind it.

What does WebVoyager actually test?

WebVoyager measures end-to-end web agents on 643 tasks across 15 live production sites, among them Amazon, Apple, Allrecipes, ArXiv, BBC News, GitHub, and several Google services. Tasks run against the real web rather than sandboxed clones, and cover shopping, search, and reservation flows. The agent works from screenshots with labeled interactive elements, so the benchmark tests multimodal navigation rather than HTML parsing alone.

Is WebVoyager still a useful benchmark in 2026?

It is now mostly a smoke test. The original GPT-4V baseline scored roughly 55 percent, but by 2026 frontier models exceed 97 percent. The benchmark is effectively saturated and no longer separates strong agents from average ones. The practical read is a floor: if your agent cannot clear about 80 percent on WebVoyager, it is not production-ready. Above that floor the score stops being informative.

What is the best alternative to WebVoyager?

For ranking modern agents that already saturate WebVoyager, move to harder live-web suites. Online-Mind2Web and the ClawBench live-task suite still discriminate between current frontier agents. WebArena and Mind2Web are other common reference points, though WebArena uses self-hosted site clones rather than the live web. Use WebVoyager as an entry-level check, then switch to one of these for a meaningful head-to-head comparison.

WebVoyager vs WebArena: how do they differ?

Both evaluate web agents, but the testbed differs. WebVoyager runs tasks against 15 real production websites, so results reflect live, changing pages. WebArena instead uses self-hosted, reproducible clones of sites such as a shopping app and a forum app, which makes runs deterministic and offline but less representative of the open web. Choose WebVoyager for realism and WebArena for repeatable, controlled scoring without external site drift.

Weekly briefing – tool launches, legal shifts, market data.

Visit

WebVoyager

Visit →