OSWorld

Computer-use benchmark with 369 real tasks across Ubuntu, Windows, and macOS environments – the reference eval for agents that act on full operating systems, not just browsers.

Maintained by Nathan Kessler·Updated May 7, 2026

Benchmarks are how you separate marketing claims from measured reality. Instead of trusting vendor-reported numbers, benchmarks run the same tasks against every system under a shared methodology and publish the results. For AI product builders picking an agentic extraction or search stack, a trustworthy benchmark is the single best input to the build-vs-buy decision – and a fast way to spot when a category is still too immature to rely on.

Visit OSWorld →

Features

✓JS Rendering

✓Structured Output

✓Open Source

✓Self-Hosted Option

Pricing:Free

Editorial assessment

The most-cited computer-use benchmark – the one Anthropic, OpenAI, and Google quote when announcing agent capabilities. Tasks span Chrome, VS Code, Thunderbird, LibreOffice, GIMP, file managers, and OS-level workflows like file format conversions and calendar setup. Each task ships with an executable evaluation script, so scoring is deterministic. Still hard for frontier models – Claude Sonnet 4.5 scored 61.4% in mid-2025, and a year later top systems are still well under 80% on the harder splits. For AI products that need to span apps (not just web pages), OSWorld is the right benchmark to test against. Browser-only agents should stick to WebArena, Mind2Web, or ClawBench.

How OSWorld compares

WebArena

WebArena is the browser-only counterpart – use it instead of OSWorld when your agent never leaves the browser.

ClawBench

ClawBench is the live-web benchmark for browser agents and is more honest about real-world transfer than self-hosted suites.

Mind2Web

Mind2Web is the right benchmark for browser-only agents that need to generalize across many websites.

See full OSWorld alternatives comparison →

Weekly briefing — tool launches, legal shifts, market data.

Visit

OSWorld

Visit →