OSWorld
OSWorld is a benchmark for computer-use agents, released in 2024 by researchers from the University of Hong Kong, Salesforce Research, and collaborators. Unlike WebArena, which is confined to a browser, OSWorld evaluates agents that operate a full desktop operating system: opening applications, clicking through GUIs, editing documents, manipulating files, and chaining actions across Chrome, LibreOffice, VS Code, GIMP, and the system terminal. The benchmark contains 369 real-world tasks across Ubuntu, Windows, and macOS environments, each with a programmatic grader that checks whether the resulting OS state matches the goal.
OSWorld is positioned as the canonical evaluation for the "computer use" capability that Anthropic, OpenAI, and Google have been racing to ship. When the paper was published, Claude 3.5 Sonnet's computer-use mode scored about 14 percent and humans scored 72 percent. By 2025, frontier models (Claude 4, GPT-4.1, Gemini 2.5) have closed some of the gap, but reliability on multi-step desktop tasks remains the dominant failure mode. The benchmark is significant because it tests the most general form of agent capability – a model that can drive a computer can, in principle, perform most knowledge-work tasks an employee performs.
For AI builders, OSWorld scores matter when evaluating computer-use APIs (Anthropic's computer use, OpenAI's Operator, Google's Project Mariner) or end-to-end automation products. The benchmark is harder than WebArena because it requires perception of arbitrary GUIs, not just well-structured web pages, and because errors compound across longer action sequences. Teams considering computer-use agents for production should treat current OSWorld numbers as a ceiling – real-world reliability is typically lower because production environments involve unfamiliar applications, custom workflows, and stricter accuracy requirements.
Tools that handle osworld
5 tools in the serp.fast directory are commonly used for osworld workflows, spanning benchmarks, browser infrastructure, agentic extraction. Each is reviewed independently with pricing and editorial assessment.
Open source benchmark evaluating AI browser agents on 153 everyday tasks across 144 live websites, with request interception and full behavioral trace capture.
Open-source Python framework making websites accessible to AI agents – the #1 browser automation project by GitHub stars.
TypeScript SDK by Browserbase for building AI-powered web automation – act, extract, and observe with natural language commands.
AI agent for browser-based workflow automation – uses computer vision and LLMs to navigate, interact with, and extract data from websites.
Live-web benchmark of 643 tasks across 15 real websites (Allrecipes, Amazon, Apple, ArXiv, BBC News, GitHub, Google variants, etc.) for end-to-end multimodal web agents.