WebVoyager

Live-web benchmark of 643 tasks across 15 real websites (Allrecipes, Amazon, Apple, ArXiv, BBC News, GitHub, Google variants, etc.) for end-to-end multimodal web agents.

Maintained by Nathan Kessler·Updated May 7, 2026

Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is the single best input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.

Visit WebVoyager →

Features

✓JS Rendering

✓Structured Output

✓Open Source

—Self-Hosted Option

Pricing:Free

Editorial assessment

The first widely-adopted live-web benchmark – tasks run against real production sites, not sandboxed clones. The original GPT-4V baseline scored 55%; by 2026 frontier models exceed 97%, which means the benchmark is effectively saturated and no longer a discriminator for new agents. Still useful as a smoke test – if your agent can't clear 80% on WebVoyager, it's not ready for production. For meaningful comparison between modern agents, move to harder live-web benchmarks (Online-Mind2Web) or the ClawBench live-task suite.

How WebVoyager compares

WebArena

WebArena is the reproducible self-hosted alternative – better for ablation studies and reliable scoring, worse for measuring real-world transfer.

Mind2Web

Mind2Web is the more current live-web benchmark, particularly the Online-Mind2Web variant that hasn't saturated.

ClawBench

ClawBench is the live-web benchmark of choice if you want behavioral traces and not just success rates.

See full WebVoyager alternatives comparison →

Weekly briefing — tool launches, legal shifts, market data.

Visit

WebVoyager

Visit →