serp.fast

ClawBench

Open source benchmark evaluating AI browser agents on 153 everyday tasks across 144 live websites, with request interception and full behavioral trace capture.

Nathan Kessler
By Nathan KesslerUpdated

Each tool is evaluated against our methodology using public docs, vendor demos, and hands-on testing.

Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is the single best input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.

Features

JS Rendering
Structured Output
Open Source
Self-Hosted Option
Pricing:Free

Editorial assessment

A rare benchmark run on real production websites rather than sandboxed environments. Five layers of behavioral data – session replay, screenshots, HTTP traffic, reasoning traces, and browser actions – make failure analysis tractable, while a request interceptor blocks irreversible actions like payments and bookings before they fire. Ships as `pip install clawbench-eval` with an interactive leaderboard and trace viewer at claw-bench.com. The current numbers are sobering: Claude Sonnet 4.6 tops at 33.3%, GLM-5 trails at 24.2%, and no model exceeds 50% in any category. Finance and academic tasks are easier; travel and dev tasks are much harder. If you're building or picking an agentic extraction stack, this is the honest scoreboard to test against.

Weekly briefing — tool launches, legal shifts, market data.

Visit

ClawBench

Visit →