Who owns the index? The question AI builders skip when evaluating search APIs

When an AI product team evaluates a search API, the process is predictable. Pick a representative query set. Run it through the candidate APIs. Compare recall, latency at the 95th percentile, and cost per thousand queries. The vendor that clears the recall bar at the best price gets the contract.

The question that almost never appears in this evaluation: where does the underlying index come from?

It is, by a large margin, the most consequential question that teams skip. Not because the benchmark results are unimportant – they aren't – but because index ownership determines vendor durability, and vendor durability determines whether the dependency you're building this month will still work in eighteen months.

The AI search API market has grown fast enough in 2025–2026 that independent-index providers, Bing licensees, and SERP scrapers all compete for the same deals, often at similar price points, often with similar recall on standard benchmark sets. They are not the same product. Understanding the differences is due diligence.

Three tiers of index ownership

The market currently breaks into three structurally distinct tiers.

Tier 1: Fully independent indexes. These vendors operate their own crawlers and maintain their own web index. They do not license results from a third party. Brave Search is the clearest example: in April 2022, Brave publicly removed its fallback to Google's index, making it fully self-sufficient. Its crawler, Googlebot-equivalent in behavior but independently operated, indexes the open web continuously. Exa – formerly Metaphor Systems, rebranded in late 2023 – built a neural-first index from scratch beginning in 2022, designed for semantic and entity-based recall rather than keyword matching.

LinkUp, a French startup that entered the market in 2024, maintains its own index built from publisher-licensed content, deliberately designed around GDPR obligations that affect where and how data can be retained. It is smaller than Brave or Exa but is genuinely independent – it does not route queries through a third-party search engine.

Fully independent providers own their data and their roadmap. They can change retrieval algorithms, expand crawl coverage, and adjust data retention policies without asking anyone's permission. The cost is infrastructure: operating a web-scale crawler is expensive, and smaller independent indexes have real coverage limitations compared to Google and Bing.

Tier 2: Hybrid and licensed architectures. Perplexity's Sonar API draws from Perplexity's own real-time crawler, which is designed for near-real-time indexing of news and high-velocity domains. But Perplexity also integrates structured data from partners and leverages data that originates outside its own crawl. The resulting API is fast and fresh, particularly for recent events, but its independence is partial rather than complete.

Bing's Web Search API is in a category of its own: not truly independent (you're dependent on Microsoft's index decisions and pricing), but not a scraper either. Microsoft publishes API terms, runs a legitimate business around the index, and has historically been less aggressive than Google in restricting downstream API access. Several AI search startups have used Bing's API as the foundation for higher-level products. You.com, for instance, combines Bing-licensed results with proprietary processing. The risk profile is real – licensing arrangements can change – but it is meaningfully lower than the tier below.

Tier 3: SERP scrapers. SerpApi, ValueSerp, Serpstack, and a long tail of similar providers are not search APIs in the index sense. They send queries to Google or Bing, scrape the returned search engine results pages, and return the extracted data through an API interface. This is a legitimate product category – for SERP-specific use cases (rank tracking, competitor keyword intelligence, shopping results, local results), a SERP scraper is exactly the right tool. It is not a substitute for an AI-native search API.

SERP scrapers are structurally dependent on continued access to search engines that do not officially permit that access at scale. Google's terms of service explicitly prohibit automated querying of its search results. The practical enforcement of that prohibition has been inconsistent, but it has not been absent. Relying on a SERP scraper as the search backbone of a production AI product is accepting a dependency on continued tolerance from a platform that has every incentive to push developers toward paid products.

Why the benchmark doesn't tell you this

The problem with using recall benchmarks to select search APIs is that recall benchmarks measure recall, not durability. A SERP scraper can achieve excellent recall on a test query set – it is, after all, returning Google results. The test cannot distinguish between "this API has a robust, independently sustainable data layer" and "this API has excellent results because it is piggybacking on Google's infrastructure without authorization."

Similarly, a smaller independent index might underperform on the benchmark's navigational queries while dramatically outperforming on the entity-recall or semantic-similarity queries that matter most for an AI application's actual use case. Benchmark construction matters as much as benchmark scores.

The Twitter API is the canonical cautionary tale for platform dependency. When Twitter (now X) reversed its API policy in early 2023 – ending free tier access, repricing the paid tiers by an order of magnitude, and restricting previously-permitted use cases – thousands of products that had been built on that data had to scramble for alternatives. Some had no viable alternative. The API they had selected on performance and price grounds ceased to exist in its original form within weeks.

No one evaluating a Twitter API competitor in 2021 would have predicted the 2023 outcome. But the structural dependency was always there. A dependency on a platform's continued goodwill is a dependency that can be withdrawn.

What independent actually costs

Independent indexes are not a free upgrade. They come with real trade-offs that teams should account for before selecting.

Coverage. Independent indexes are smaller than Google's and Bing's. Brave Search indexes billions of pages; Google is estimated to index hundreds of billions. For broad navigational queries or long-tail domains with sparse backlinks, an independent index will miss pages that Google would find. For research-oriented, technical documentation, or academic content queries – which describe most AI application workloads – the gap is much smaller and often imperceptible.

Freshness for niche domains. For high-traffic domains, independent crawlers keep up well. For the long tail – small business sites, infrequently-updated microsites, new domains – crawl freshness in an independent index can lag by weeks or months. This matters for AI applications that need current data across broad domain categories. It matters less for applications focused on specific, high-traffic content categories.

Query type fit. Exa's neural index was built for entity and concept retrieval – find me documents about X, find me sources discussing Y. It is not optimized for navigational queries (find me the official site for Z). Teams using Exa as a Google substitute for general queries will notice recall gaps. Teams using it as a semantic document retrieval layer for RAG pipelines will often find it outperforms Bing-based alternatives on the queries that matter.

Pricing variability. Independent providers generally have higher per-query costs than Bing-licensed alternatives, because they are paying for crawl infrastructure without being able to spread that cost across Google's or Microsoft's broader business. The pricing gap has narrowed as these providers have scaled, and it should continue to narrow – but it exists.

The regulatory angle

A factor that has begun to matter in 2026 and will matter more: EU AI Act and GDPR data origin requirements are creating compliance questions for AI applications that process web data at scale. How that data was crawled, where it is stored, and what rights exist over it are questions that regulators have started asking.

Fully independent EU-based indexes – LinkUp's GDPR-by-design architecture is the clearest example – have an answer to those questions. APIs built on top of infrastructure operated by US-based hyperscalers, or scraped from platforms without explicit authorization, have a more complicated answer. This is not an acute risk for most teams today. It will be a real compliance dimension within the next eighteen months for teams operating in regulated industries or deploying in EU markets.

What to ask before selecting

Six questions that should be standard in any search API procurement evaluation:

Does the vendor operate a crawler? Ask directly. Evasive answers – "we use multiple data sources," "we aggregate from various providers" – signal a hybrid or reseller architecture. A vendor with an independent crawler can describe it specifically: how many pages indexed, crawl frequency, coverage methodology.

What is the upstream licensing arrangement? For any API that licenses index access from a third party, ask what the contract terms say about uptime commitments, data access guarantees, and pricing change windows. These terms are often not disclosed publicly, which is itself informative.

What is the crawl coverage and freshness policy? Specific numbers: approximate indexed page count, crawl intervals for popular vs. long-tail domains, recency of the most recent full index refresh.

What is the API's own history? Has the vendor changed pricing, shut down API tiers, or restricted access in the past eighteen months? How much notice did they give? This is especially important for newer entrants that haven't yet had a pricing cycle.

Is the data origin compliant with your legal requirements? For EU deployments or regulated data categories, understand where the crawl data is stored and what rights the vendor holds over it.

What is the fallback plan if this API becomes unavailable? If the honest answer is "we haven't thought about this," that is useful information.

The AI search API market in 2026 offers more genuinely good options than it did two years ago. The field has expanded from a handful of experimental APIs to a competitive market with vendors across all three tiers. That expansion also means that the surface-level benchmark differences between tiers have compressed – it is possible to pick a SERP scraper that performs as well as an independent index on a test query set, at lower initial cost, and only discover the structural problem later.

Teams that understand the index ownership question before they sign a contract will make better-calibrated decisions about what they are actually buying. The benchmark tells you what the API does today. The index ownership question tells you whether it will still do it in two years.

Know which question you are answering.