Question 1

Is ClawBench free to use?

Accepted Answer

Yes. ClawBench is open source and free, with no paid tier or subscription. You install the evaluation harness with pip install clawbench-eval and run it against your own agent. The leaderboard and an interactive trace viewer are public at claw-bench.com. Because it is a benchmark rather than a hosted product, your only real cost is the model API usage you spend running the tasks.

Question 2

Is ClawBench open source and can I self-host it?

Accepted Answer

Yes on both counts. ClawBench is open source, and the evaluation harness installs locally through pip install clawbench-eval, so you run it on your own infrastructure instead of calling a hosted service. That lets you score in-house agents privately, inspect the task definitions, and reproduce results without sending your runs to a third party.

Question 3

What does ClawBench actually measure?

Accepted Answer

ClawBench measures whether AI browser agents can finish everyday online tasks on real, live websites rather than sandboxed clones. The corpus covers 153 tasks across 144 sites, such as booking travel, ordering food, and applying for jobs. It captures five layers of behavioral data per run: session replay, screenshots, HTTP traffic, reasoning traces, and browser actions. A request interceptor blocks irreversible actions like payments before they fire.

Question 4

How do AI agents score on ClawBench?

Accepted Answer

Low, which is the point. The top system, Claude Sonnet 4.6, reaches 33.3 percent, and no model clears 50 percent in any category. Finance and academic tasks tend to be easier, while travel and developer tasks are much harder. The leaderboard is updated as newer models and harnesses are submitted, so check claw-bench.com for current standings rather than relying on a fixed figure.

Question 5

How does ClawBench compare to Browser Use?

Accepted Answer

They solve different problems. Browser Use is an agent framework you build browser automation with, whereas ClawBench is the scoreboard you test such agents against. They work together. You would run a Browser Use agent through the ClawBench harness to see how it handles live sites. Stagehand and Skyvern are the closest agent-framework alternatives if you are choosing a stack. ClawBench itself has no real substitute for evaluating on real production websites.

Question 6

Who should use ClawBench?

Accepted Answer

Teams building or selecting an agentic web-extraction or automation stack. If you are deciding between agent frameworks or model backends for browser tasks, ClawBench gives you a reproducible test on real sites instead of vendor demos. It is less useful if you only need a finished scraper, since it is an evaluation tool, not a data API. Pair it with a framework like Browser Use, Stagehand, or Skyvern to generate the runs you score.

Frequently asked questions