When does an AI agent actually need a cloud browser?

Five concrete cases: the task requires interaction (click, type, submit), authentication flows are involved, state must persist across multiple page actions, screenshot/visual capture is needed, or you need to control rendering specifically (wait for elements, intercept requests). For everything else – fetching rendered content from public URLs – a scraping API like Firecrawl is faster and cheaper.

How is Browserbase different from Steel.dev and Browserless?

Browserbase is the funded enterprise leader ($300M valuation, 50M+ sessions, customers like Perplexity and Vercel) with the most polished SDK (Stagehand). Steel.dev is open-source with self-hosting available; its differentiator is reducing the token volume sent to LLMs by stripping page content. Browserless has been bootstrapped since 2017 with predictable Docker-based deployment and flat $25–$200/month pricing.

Should I use Playwright or Puppeteer with a cloud browser?

Playwright for new projects. Multi-browser support (Chromium, Firefox, WebKit), official Python/Java/.NET/TypeScript bindings, auto-wait that eliminates a class of flaky scripts, and roughly 37M weekly npm downloads. Puppeteer remains viable for existing Chrome-only Node codebases but lacks Playwright's Python story – which is decisive for AI/ML stacks built on LangChain, LlamaIndex, CrewAI, or Browser Use.

What does a browser session actually cost?

Per-session pricing on most providers runs $0.01–$0.10 per session. At 10K sessions/day at $0.05 each, that's $500/day or $15K/month. Browserless flat plans run $25–$200/month with concurrency limits – predictable but throughput-capped. Self-hosted Playwright or Puppeteer has no per-session cost, but server, maintenance, and operations replace it. Self-hosting becomes economical above ~10K sessions/day.

What is Stagehand and do I need it?

Stagehand is Browserbase's open-source TypeScript SDK that exposes high-level act/extract/observe primitives over Playwright. An agent describes intent in natural language ("click the sign-in button") and Stagehand translates to browser actions. Use it when you want to abstract DOM details from agent code; skip it if you'd rather write Playwright directly. Skyvern is the equivalent for vision-based interaction – useful when DOM selectors are fragile.

Cloud Browser Infrastructure for AI Agents

Why AI agents need browsers, how cloud browser providers compare, and when to use one instead of a scraping API.

Nathan Kessler·Feb 23, 2026·Reviewed Feb 23, 2026

11 min read

Each tool referenced is evaluated against our methodology using public docs, vendor demos, and hands-on testing.

Canonical answer

Use a cloud browser when your agent needs interaction, authentication, multi-step state, or screenshots – not just rendered HTML. Browserbase is the funded enterprise default (50M+ sessions, Stagehand SDK). Steel.dev is open-source and strips DOM tokens before they reach the LLM. Browserless offers flat $25–$200/month Docker-based pricing. For pure URL-to-content with no interaction, a scraping API like Firecrawl is faster and cheaper.

Click, type, submit, or auth flows required – Browserbase plus Stagehand.
Open-source self-hosting and lower LLM token spend – Steel.dev.
Predictable flat-rate pricing for Docker-based Chromium – Browserless from $25/month.
Visual interaction when DOM selectors break – Skyvern's vision-based agent.
Just rendered content from a public URL – skip browsers, use Firecrawl or ScraperAPI.

An HTTP request returns raw HTML. For roughly 40% of the web, that is not enough. JavaScript-rendered single-page applications, content loaded dynamically via API calls, authentication flows requiring cookies and sessions, interactive elements that respond to clicks and scrolls – these all require a browser.

AI agents face this problem constantly. A research agent needs to read a page that renders its content client-side. An automation agent needs to fill out a form and submit it. A monitoring agent needs to log into a dashboard and check a status. None of these work with curl.

This guide covers why browsers matter for AI agents, how the cloud browser category works, and the decision framework for when you need one.

Why AI agents need browsers

JavaScript rendering

The modern web runs on JavaScript. React, Vue, Angular, Next.js, and their equivalents power a large and growing share of websites. These frameworks render content in the browser, not on the server. When an HTTP client requests such a page, it gets a shell – a minimal HTML document with JavaScript bundles that, when executed in a browser, produce the actual content.

For AI agents, this means that fetching a URL with a simple HTTP library often returns an empty or partial page. The product listing, the article text, the dashboard data – it is all generated by JavaScript that never executes without a browser runtime.

Many scraping APIs handle this transparently. ScraperAPI and Firecrawl both render JavaScript as part of their scraping pipeline. But the rendering is a black box – you get the final HTML, not control over the browser that produced it.

When you need to control the rendering process – wait for specific elements, interact with the page during rendering, capture intermediate states – you need direct browser access.

Authentication and sessions

A significant portion of valuable web data sits behind login walls. Business intelligence dashboards, internal tools, partner portals, customer accounts – accessing this data requires authentication. Authentication means maintaining cookies, handling redirects, managing session tokens, and sometimes completing multi-factor authentication flows.

HTTP-based tools can handle simple cookie-based sessions, but modern authentication is rarely simple. OAuth flows, CAPTCHA challenges, two-factor authentication via SMS or authenticator apps, WebAuthn hardware keys – these require a browser that can render UI, interact with forms, and maintain complex state across multiple page loads.

Multi-step interaction

Some agent tasks are not read operations. They are interaction sequences: navigate to a page, click a button, wait for a modal, fill in a form, upload a file, confirm a submission, read the result. Each step depends on the previous one. The page state changes with every action.

This is fundamentally different from making API calls. There is no endpoint to hit. The "API" is the website's user interface, and the only way to use it is through a browser.

Screenshot and visual capture

AI agents increasingly use vision capabilities. A model with image understanding can look at a screenshot of a web page and extract information from its visual layout – charts, tables, infographics, page structure. This requires rendering the page in a browser and capturing a screenshot, which is a native browser operation that HTTP clients cannot perform.

Cloud browser providers

Running browser instances at scale is an infrastructure problem. A single Chrome instance consumes 200-500 MB of memory. Running hundreds concurrently requires significant server resources – CPU, memory, network – plus orchestration logic for session management, cleanup, and fault recovery.

Cloud browser providers solve this by running browser fleets as a service. You request a session, get a browser instance in the cloud, control it via an API, and release it when done. The provider handles infrastructure, scaling, and maintenance.

Browserbase

The most funded company in the category, with $67.5 million raised including a $40 million Series B at a $300 million valuation. CEO Paul Klein IV has framed the thesis clearly: AI agents browsing the web need purpose-built infrastructure, and Chromium was not designed for programmatic control at scale.

Browserbase has served over 50 million browser sessions across more than 1,000 companies and 20,000 developers. Its customer list includes Perplexity, 11x, and Vercel. Notable Capital's Glenn Solomon called it "the Stripe for browser automation."

The platform provides cloud browser sessions with built-in stealth capabilities (fingerprinting, proxy rotation), session recording for debugging, and an MCP server for direct integration with AI agents. Browserbase also develops Stagehand, an open-source SDK that lets AI agents control browsers using natural language descriptions of actions rather than CSS selectors.

For product teams, Browserbase is the enterprise-grade default. The funding, customer base, and ecosystem (Stagehand, MCP integration) make it the most established option for AI agent browser infrastructure.

Steel.dev

Steel.dev takes an open-source approach to the cloud browser problem. Its distinguishing focus is reducing the token volume sent to LLMs – a real cost concern when browser-based agents send entire page DOMs to models for analysis.

Steel.dev's approach strips page content to essentials before the model sees it. Instead of sending a 50,000-token DOM tree to the LLM, Steel.dev processes it down to the relevant content, reducing token consumption by up to 80% according to the company's claims.

The open-source model means teams can self-host if they prefer to avoid per-session cloud costs. For early-stage products where cost sensitivity is high and the team has infrastructure capability, this is a meaningful differentiator.

Browserless

The most mature player in the category, founded in November 2017 – years before "AI agents" entered the mainstream vocabulary. Founder Joel Griffith bootstrapped it on $500 and built it into a sustainable business serving traditional scraping and automation use cases long before the AI wave arrived.

Browserless offers a Docker-based deployment model with flat monthly pricing from $25 to $200. It handles the operational burden of running headless Chrome: concurrency limits, resource management, session cleanup, crash recovery. The pricing is predictable in a way that per-session models are not.

For teams that need managed Chromium infrastructure without the AI-specific abstractions and higher price points of newer entrants, Browserless remains a solid and proven option. It has run headless Chrome as a service for eight years.

Headless browser frameworks

Cloud browser providers build on top of open-source headless browser frameworks. Understanding the framework layer helps contextualize the cloud services.

Playwright

Playwright has become the default framework for new browser automation projects. Microsoft maintains it actively, and the adoption numbers are decisive: roughly 37 million weekly npm downloads, 83,000-plus GitHub stars, and a 2025 survey showing 45.1% adoption among QA professionals – more than double Selenium's 22.1%.

For AI agent applications, Playwright's advantages are specific. Auto-wait reduces flakiness when pages load unpredictably. Network interception lets you filter requests and modify responses mid-flight. Multi-browser support (Chromium, Firefox, WebKit) lets you test across rendering engines. The Python, Node.js, Java, and .NET bindings mean it integrates with whatever language your agent system uses.

Playwright also has an official MCP server, maintained by Microsoft, which lets AI agents control local browser instances through the MCP protocol. This is useful for development and testing: the agent uses the same tool interface it would use with a cloud provider, but the browser runs locally.

Puppeteer

Google's Puppeteer was the first major Node.js browser automation library. It maintains roughly 17 million combined weekly npm downloads (puppeteer plus puppeteer-core) and over 89,000 GitHub stars – technically the highest star count in the category.

Puppeteer is Chromium-only, which is both a constraint and a simplification. If you are working exclusively with Chrome and need deep Chrome DevTools Protocol access, Puppeteer provides it with less abstraction than Playwright.

For new projects, Playwright is the standard recommendation. For existing Puppeteer-based systems, the migration cost often outweighs the benefits of switching. Both tools work well. Playwright is where the momentum is.

When to use a framework directly

Running Playwright or Puppeteer directly – without a cloud browser provider – makes sense in specific situations.

Development and testing. During development, running a local browser is faster, cheaper, and easier to debug than cloud sessions. You can watch the browser, set breakpoints, inspect state.

Low volume. Under a few thousand browser sessions per day, running your own browser instances on a modest server is straightforward. The infrastructure overhead is manageable, and you avoid per-session costs.

Full control requirements. If you need to install browser extensions, modify browser binaries, or control network conditions at the OS level, direct framework access gives you capabilities that cloud providers abstract away.

Use cases in detail

Web scraping with browser rendering

The most common use case: fetch a page that requires JavaScript rendering, wait for the content to load, extract the text. This is the core task browser-based scraping handles.

For this use case, the question is whether you need a cloud browser or whether a scraping API with built-in rendering is sufficient. ScraperAPI, Firecrawl, and others render JavaScript as part of their scraping pipeline. You send a URL, they return rendered content. No browser session management required.

Use a cloud browser when: you need to control the rendering process (wait for specific elements, interact with the page), you need to handle authentication, or the page requires actions beyond a simple page load.

Use a scraping API when: you just need the rendered content from a URL, with no interaction, no authentication, and no special rendering requirements.

AI agent web interaction

An AI agent needs to book a flight, file a form, check a status page, or complete a multi-step web workflow. This requires a browser session that persists across multiple actions.

The pattern: request a cloud browser session from Browserbase or Steel.dev. Use Stagehand or Skyvern to translate the agent's intent into browser actions. Execute the workflow step by step, with the LLM reasoning about each step's result before deciding the next action.

Stagehand works by letting the agent describe actions in natural language – "click the login button," "type the email address," "select the second option from the dropdown." It translates these descriptions to Playwright commands using the page's DOM.

Skyvern approaches the same problem through computer vision. It captures screenshots and uses visual understanding to determine how to interact with the page. This makes it more resilient to page layout changes but adds latency for the vision processing step.

Screenshot and visual analysis

With multimodal models that understand images, screenshots become a data extraction method. Render a page, capture a screenshot, send it to the model, ask it to extract information from the visual layout.

This works surprisingly well for pages that resist traditional scraping – complex dashboards, infographics, tables with unusual layouts, pages that use canvas or WebGL rendering. The browser renders the page exactly as a human would see it, and the model interprets the visual output.

The limitation is cost and latency. A screenshot-based extraction requires rendering the page (browser cost), capturing the image (trivial), and processing it through a vision model (model cost, additional latency). For high-volume use cases, this is expensive relative to DOM-based extraction.

Automated testing

This predates the AI wave but remains a significant use case. Cloud browsers run test suites at scale – hundreds of concurrent sessions testing web applications across different browsers and viewports. Browserless has served this use case since 2017. Browserbase's scale makes it viable for large test suites.

Pricing and scaling

Cloud browser pricing models vary, and the cost structure matters for production planning.

Per-session pricing. Browserbase and most newer entrants charge per browser session, typically $0.01-0.10 per session depending on duration and features. This is economical at low volumes but becomes significant at scale. Ten thousand sessions per day at $0.05 per session is $500 per day, $15,000 per month.

Flat monthly pricing. Browserless charges $25-200 per month with concurrency limits. This is more predictable but constrains throughput. If you exceed the concurrency limit, sessions queue or fail.

Self-hosted. Running your own browser fleet using Playwright or Puppeteer has no per-session cost, but infrastructure costs (servers, maintenance, operations) replace it. At sufficient scale – typically over 10,000 sessions per day – self-hosting becomes more economical than per-session cloud pricing.

Token costs. An often-overlooked component: the cost of sending page content to an LLM. A typical web page's DOM can be 20,000-50,000 tokens. If your agent sends full page content to the model for analysis, the model API costs can exceed the browser session costs. Steel.dev's focus on reducing token volume before model processing addresses this directly.

When you need a cloud browser vs. when a scraping API suffices

The decision tree is straightforward.

You need a cloud browser when:

The task requires interaction (clicking, typing, form submission)
You need to handle authentication flows
You need to maintain state across multiple page actions
You need screenshots or visual capture
You need to control the rendering process specifically

A scraping API suffices when:

You need rendered content from URLs
The pages are publicly accessible
No interaction is required
You need structured data extraction from the page content
You are processing many URLs and need a simple, scalable interface

A simple HTTP client suffices when:

The pages serve content as server-rendered HTML
No JavaScript rendering is required
The data is in the initial HTML response

Most production agent systems end up using multiple layers. Scraping APIs (Firecrawl, ScraperAPI) handle the 80% case – rendered content from public URLs. Cloud browsers (Browserbase, Steel.dev, Browserless) handle the 20% that requires interaction or authentication. Direct HTTP requests handle the simplest cases.

The cost difference between layers is significant. An HTTP request costs fractions of a cent. A scraping API call costs between $0.001 and $0.01. A cloud browser session costs $0.01 to $0.10. Matching the right tool to the right task keeps infrastructure costs manageable as agent systems scale.

Making the decision

For product leaders evaluating browser infrastructure, three factors dominate the decision.

Start with the use case. If your agents only need to read web content, start with scraping APIs. Add browser infrastructure only when you encounter tasks that require it. Most teams overestimate how much browser interaction they need.

Plan for cost at scale. Per-session browser costs are manageable for prototypes and low-volume applications. At production scale, they compound. Model the cost of your expected session volume across providers, and consider self-hosted Playwright or Puppeteer as a cost ceiling reference.

Evaluate the AI integration layer. Cloud browsers are infrastructure. What makes them useful for AI agents is the integration layer – MCP servers, SDKs like Stagehand, vision-based interaction tools like Skyvern. Evaluate these tools alongside the infrastructure, because the integration quality determines how effectively your agents can use the browsers. For benchmark-grade comparisons of which agent platforms perform best on standardized web tasks, see our AI web agent benchmarks guide.

The browser infrastructure category drew over $100 million in funding in the past year, so the providers, pricing, and SDKs covered here will change. Re-check the numbers before committing to a vendor. The underlying constraint does not change: AI agents need a browser for the subset of web tasks that require interaction, authentication, persistent state, or visual capture, and HTTP clients cannot serve those.

Frequently asked

When does an AI agent actually need a cloud browser?: Five concrete cases: the task requires interaction (click, type, submit), authentication flows are involved, state must persist across multiple page actions, screenshot/visual capture is needed, or you need to control rendering specifically (wait for elements, intercept requests). For everything else – fetching rendered content from public URLs – a scraping API like Firecrawl is faster and cheaper.
How is Browserbase different from Steel.dev and Browserless?: Browserbase is the funded enterprise leader ($300M valuation, 50M+ sessions, customers like Perplexity and Vercel) with the most polished SDK (Stagehand). Steel.dev is open-source with self-hosting available; its differentiator is reducing the token volume sent to LLMs by stripping page content. Browserless has been bootstrapped since 2017 with predictable Docker-based deployment and flat $25–$200/month pricing.
Should I use Playwright or Puppeteer with a cloud browser?: Playwright for new projects. Multi-browser support (Chromium, Firefox, WebKit), official Python/Java/.NET/TypeScript bindings, auto-wait that eliminates a class of flaky scripts, and roughly 37M weekly npm downloads. Puppeteer remains viable for existing Chrome-only Node codebases but lacks Playwright's Python story – which is decisive for AI/ML stacks built on LangChain, LlamaIndex, CrewAI, or Browser Use.
What does a browser session actually cost?: Per-session pricing on most providers runs $0.01–$0.10 per session. At 10K sessions/day at $0.05 each, that's $500/day or $15K/month. Browserless flat plans run $25–$200/month with concurrency limits – predictable but throughput-capped. Self-hosted Playwright or Puppeteer has no per-session cost, but server, maintenance, and operations replace it. Self-hosting becomes economical above ~10K sessions/day.
What is Stagehand and do I need it?: Stagehand is Browserbase's open-source TypeScript SDK that exposes high-level act/extract/observe primitives over Playwright. An agent describes intent in natural language ("click the sign-in button") and Stagehand translates to browser actions. Use it when you want to abstract DOM details from agent code; skip it if you'd rather write Playwright directly. Skyvern is the equivalent for vision-based interaction – useful when DOM selectors are fragile.

browser infrastructureai agentscloud browsersheadless browsers

Related guides

Web Access for AI Agents: Architecture & Tools
May 7, 2026 · 10 min read
AI Search APIs Compared: Exa vs Tavily vs Perplexity vs Linkup (2026)
Jun 19, 2026 · 9 min read
Best Open-Source Web Scraping Frameworks in 2026: A Builder's Guide
May 11, 2026 · 6 min read

Compare the tools mentioned

Weekly briefing – tool launches, legal shifts, market data.