The web is closing as the agents arrive: what bot-majority traffic means for web data access

As of June 2026, automated bots generate 57.4% of requests to the websites Cloudflare measures, against 42.6% from humans. That is the first time bots have outnumbered people, and it arrived ahead of schedule. Cloudflare CEO Matthew Prince, who had expected the crossover near the end of 2027, summarized it on June 4: "Welp, that happened faster than I predicted."

For anyone building on web data, the number matters less than the reaction it is provoking. The open-access default that most AI products quietly depend on, the assumption that a public page will still be fetchable next quarter, is eroding. Sites are adding authentication, metering developer access, and signing licensing deals with AI companies. If your RAG pipeline, agent, or search feature pulls from the open web, the supply side is shifting underneath you, and the shift looks structural rather than seasonal.

What follows is what is actually happening to web data access for AI agents in mid-2026, drawn from the clearest public cases, and what it implies for how you source data.

Two curves crossing

Demand for web data is climbing on a curve with no recent precedent. Adobe reported that AI-driven traffic to US retail sites rose 138% year over year in May 2026, and is up 1,324% since October 2024. Claude's web visits grew more than 360% between January and May 2026, according to Forbes. In late May and June, Robinhood and then Coinbase shipped agents that can act on a user's behalf, including trading and paying for premium research. The agents are not demos anymore. They are traffic, and increasingly traffic that spends money.

Agent load is not just more of the same load. It is shaped differently. NBC, reporting the Cloudflare figure, put the asymmetry well: a human might visit five websites before making a purchase, while an AI service might visit five thousand. A person reads one product page; an agent comparison-shops the whole category, refetches for freshness, and retries on failure. The same task that once produced a handful of human pageviews now produces thousands of automated ones.

That is the crossing point. Demand for programmatic access is compounding at the exact moment the cost of serving it becomes visible to the sites paying for bandwidth and origin compute. When one side of a market grows 1,300% and the other side starts itemizing the bill, something gives.

The supply side is pulling up the drawbridge

Strava is the cleanest case to date. On June 1, 2026, ahead of its planned IPO, the company moved data that used to be public, including profiles and fitness club listings, behind authentication. It introduced a flat $11.99 per month fee for developer API access, with pricing that may vary by region, and began retiring some API endpoints on a 90-day grace period. Its developer community had grown from 185,000 to 241,000 members in a year, so this is not a small or unloved surface being quietly deprecated. It is a deliberate tightening of a heavily used one.

Strava's CEO, Michael Martin, was blunt about the motive: "AI companies are ruthlessly scraping public websites, given their endless need for training data, which is degrading site performance." Read past the framing and the mechanics generalize. A site notices unsustainable automated load, then reaches for the levers every site has: authentication walls, rate caps, per-developer pricing, endpoint sunsets, and content licensing deals struck directly with AI companies. None of these is new on its own. What is new is a flagship consumer platform deploying them together, on a public schedule, and saying out loud that AI scraping is the reason.

Strava will not be the last. Expect the pattern wherever a site holds data that is valuable to models and runs a business that does not depend on being freely crawlable. The takeaway for builders is not that scraping is doomed; it is that durable, free, unauthenticated access to any given site can no longer be an assumption baked into your architecture. The access patterns worth understanding are changing, and our guide to agentic web access tracks how.

MCP as a gate, not just a pipe

The most instructive detail in the Strava move is easy to miss. Alongside the walls and fees, the company said it plans to add support for the Model Context Protocol, specifically to control exactly what gets shared and how.

That reframes MCP. Most builders treat it as a convenience layer, a tidy way to wire tools into an assistant, and our primer on MCP and tool use covers that side of it. Strava is treating it as access control. The protocol becomes the sanctioned, metered front door, opened on the vendor's terms, while the unsanctioned side entrance of open scraping gets bricked up. You get structured, permissioned access to what the vendor chooses to expose, and you lose the ability to take whatever a logged-out browser could see.

That is a reasonable trade for the vendor and a real constraint for you. A sanctioned channel is more stable and more legible, but its scope is set by someone whose incentives are not yours. Plan for the door to be narrower than the open web was.

What this means if you build on web data

Public is not the same as permanently available. Treat access to any specific site as a depreciating asset. A source that is free and unauthenticated today can sit behind a login, a fee, or a sunset endpoint within a 90-day window, exactly the notice Strava gave. If a feature cannot survive losing its top source on short notice, that is a design risk, not an edge case.

Sanctioned channels are gaining value over raw scraping. Official APIs, MCP servers, licensed feeds, and AI-search APIs built over an owned index are becoming the more durable way in. Tools such as Exa, Tavily, and Firecrawl sell access as a product, which makes that access contractual rather than tolerated. The contract is the point. An index that someone maintains and stands behind does not vanish when a single target site changes its robots policy. Our overview of independent search indexes for AI covers why owning, or buying from someone who owns, an index matters here.

The cost of access is moving from engineering effort to line item. For years the price of web data was mostly labor: writing and maintaining scrapers, rotating around blocks, parsing messy HTML. That cost is not going away, but a second one is landing next to it: explicit fees, per-developer pricing, and licensing. Budget for access as a recurring cost, not a one-time build. It changes the build-versus-buy calculation for a lot of teams, because the "free" side of build just got more expensive and less certain.

Resilience comes from diversification. Do not single-source a site that can wall you out on a quarter's notice. Spread critical data across multiple providers and channels, and know your fallback in advance for when one of them closes. The teams that handle the next round of lockdowns well will be the ones who already assumed it was coming.

The uncomfortable symmetry

It is worth being honest about the other side of this. If bots are 57.4% of traffic, AI builders account for a large share of that 57.4%. The performance degradation Strava describes is not invented; aggressive, redundant crawling does impose real costs on the sites being crawled. The squeeze is partly a response to behavior the industry chose. Saying so is not hand-wringing. It is the reason the sanctioned, metered model is likely to win: it gives sites a way to say yes that does not mean absorbing unbounded load for free. Builders who route through those channels end up paying their share of a bill that used to be hidden.

What to watch over the next 6 to 12 months

Three things are worth tracking. Pay-per-crawl arrangements, where access is priced and permissioned at the edge, will probably move from experiment to normal. More sites with model-relevant data and crawl-independent businesses will follow Strava behind authentication. And MCP, or whatever sanctioned-access standard consolidates, will increasingly act as a gate rather than a convenience, with vendors deciding exactly what passes through.

The through line for builders is easy to state and harder to execute: assume open access is contracting, and move the data you depend on onto channels you can count on before the choice gets made for you.