serp.fast

Independent Search Indexes for AI: Querying the Web Without Google

Brave, Common Crawl, Mojeek, Stract, Marginalia – the independent web indexes AI builders use when querying Google directly is not an option.

Nathan Kessler··Reviewed
8 min read

Most AI products that touch the live web start by calling Google – directly through a SERP API, indirectly through a search wrapper. That is the path of least resistance. It is also the path with the most exposure: scraping Google's SERPs has been litigated (SerpApi was sued by Google in 2024), terms can change unilaterally (Microsoft retired the Bing Search API in August 2025 with limited notice, breaking products built on it), per-query costs add up fast at scale, and your roadmap depends on a single supplier you do not control.

The alternative is to query a search index that is not Google. This guide is about which independent indexes actually exist, what each one does well, and how AI builders layer them into production systems.

Why bypass Google in the first place

Five forces push AI builders toward independent indexes:

  1. Legal exposure. Google's terms of service prohibit automated scraping of its SERPs. SerpApi-style services operate in a contested grey zone. If your product cites Google as a source, that is a procurement question your enterprise customers may eventually ask.
  2. Supplier risk. The Bing Search API shutdown in August 2025 broke production systems with weeks of notice. Single-supplier dependency on Google is a strategic vulnerability for any AI product whose core value depends on web data.
  3. Cost at scale. SERP APIs typically run $1 to $5 per 1,000 queries. At 100,000 queries per day, that compounds into a meaningful line item. Some independent indexes are cheaper per query; others (Common Crawl) are free if you absorb the infrastructure cost.
  4. Rate limits. Google's published quotas constrain high-volume agentic workloads. Independent indexes vary, but most have higher published QPS ceilings or negotiable enterprise limits.
  5. Data sovereignty. European customers increasingly ask where the underlying index lives and under whose jurisdiction. A US-only Google supply chain is harder to defend than an index choice that includes Mojeek (UK) or Qwant (France).

None of these arguments mean Google is wrong. They mean the question is no longer "should I use Google" but "what should my fallback look like, and when does the fallback become primary."

What "independent" really means

The word "independent" gets used loosely in this category. A calibration:

  • Fully independent. Operates its own crawler, builds its own index, ranks its own results. No upstream dependency on Google or Bing. Brave, Mojeek, Yandex, Stract, Marginalia, Common Crawl, Alexandria, and Gigablast all qualify.
  • Partially independent. Operates a crawler but supplements with Bing or Google results when its own index is thin. Qwant falls in this bucket – the verdict in our directory notes it relies partially on Bing.
  • Not independent. Re-ranks or proxies results from Google or Bing without its own crawl. DuckDuckGo, Startpage, and most consumer "privacy-focused" search engines fall here. They are valuable for end users but do not solve the AI builder's dependency problem.
  • AI search APIs that pair with independent indexes. Tavily is built on Brave plus its own extraction layer. Exa runs its own embeddings-based neural index. Linkup licenses content directly from publishers. These are layers above the indexes, not indexes themselves.

The litmus test: if Google or Microsoft pulled the rug tomorrow, would this provider still return results? If yes, it is independent. If no, it is a wrapper.

The independent indexes worth knowing

Brave Search API: the production default

Brave Search API is the only independent Western search index of meaningful scale that still ships a public developer surface. 40 billion+ pages, ~100 million added daily, fed by signals from the Brave browser's 100 million+ monthly active users. After Microsoft retired the Bing Search API, Brave became the sole at-scale alternative to Google for Western web search data.

Pricing is $5 per 1,000 requests on the Search Plan with 50 QPS and $5 in free monthly credits. The Answers Plan adds LLM-generated answers at $4 per 1,000 requests plus $5 per million tokens, capped at 2 QPS. That is the high end of the SERP API category, but no other provider in the category runs its own index at this scale. The trade-off is explicit: pay more, own the data path.

Best fit: agents that cite sources publicly, RAG pipelines that need a defensible corpus, and any product where a Bing-style API sunset would break your roadmap. The index skews toward English-language and high-traffic sites, so coverage on long-tail non-English queries should be measured rather than assumed.

Common Crawl: the open archive

Common Crawl is the foundational dataset behind 64% of major LLMs. 9.5 PB of freely available web data, 2.4 billion pages crawled monthly, distributed as WARC files via S3. Roughly 80% of GPT-3's training tokens came from Common Crawl.

This is not an API. There is no query endpoint. You download the slice you need, build your own index (Lucene, OpenSearch, vector stores), and run the infrastructure. Free in dollar cost, expensive in engineering hours. Best used for training data, one-off corpus building, or as the raw substrate behind a custom RAG system. Not viable as the live retrieval layer for an interactive AI product unless you have already invested in the index infrastructure.

Mojeek: the privacy-first European index

Mojeek operates its own crawler and ~3.6 billion page index out of the UK. The first search engine to pledge no tracking, principled about its independence, and the most credible non-US, non-Russian independent index.

The index is roughly an order of magnitude smaller than Brave's, and result quality reflects that. Mojeek is best used as a privacy-sensitive complement to Brave – a secondary data source for queries where a UK or European data path matters, or as a fallback if Brave's coverage is thin on a specific query. Not a primary index for high-volume Western use cases.

Stract: the open-source vision

Stract is fully open-source and non-profit, with user-customizable ranking signals and a small but growing independent index. Philosophically aligned with how transparent search ought to work. The codebase is on GitHub and self-hostable.

Practically, Stract is research-stage. The index is too small for production search, and the one-shot ranking layer has not been battle-tested against adversarial SEO. Track it as the most credible open-source attempt at a fresh independent index, not as something you ship today.

Marginalia Search: the small-web specialist

Marginalia Search deliberately indexes small, personal, text-heavy websites and excludes most commercial content. It surfaces the "old web" that Google de-prioritized over the last decade. Built and maintained by one developer, open-source, free to use.

For business-relevant search this is unusable by design. For research agents that need to find personal essays, niche technical writeups, or non-commercial primary sources, Marginalia returns content nothing else does. A useful supplemental endpoint, not a primary index.

The rest, briefly

  • Yandex Search API. Independent, Russia-based, dominant in Russian-language search. Geopolitical risk and sanctions exposure make this a fraught choice for Western companies. Use only for Russian-market workloads.
  • Qwant. French, GDPR-native, partially independent but supplements with Bing. Increasingly problematic as Bing winds down its API.
  • Webz.io. Pre-processed structured feeds for news, blogs, forums, and dark web. Enterprise pricing, no self-serve. Best for threat intelligence and media monitoring rather than general AI search.
  • Alexandria. Decentralized search project. Ideologically important, very early-stage, not production-viable.
  • Gigablast. One of the longest-lived independent crawlers, run by a single developer since 2000. Index is small and aging. Historical curiosity and educational reference rather than infrastructure.

Pairing with AI search APIs

Most production AI products do not query a raw index directly. They query an AI search API that wraps an index with extraction, ranking, and LLM-friendly formatting. The layering matters because it affects what you actually depend on.

  • Tavily is built on Brave Search API plus Tavily's own extraction and ranking layer. Choosing Tavily means accepting Brave as your transitive index supplier, with Tavily as the integration and content-extraction layer on top. Native LangChain and LlamaIndex retrievers ship out of the box.
  • Exa runs its own embeddings-based neural index. Independent of Brave and Google. The strength is semantic matching; the trade-off is that Exa's coverage and ranking philosophy are different from a keyword-style index.
  • Linkup builds an index from publisher-licensed content. The smallest in coverage but the cleanest in legal posture, since the data is licensed rather than crawled.
  • You.com aggregates across multiple sources rather than maintaining a single proprietary index. Useful for breadth, less useful when you need a single accountable data path.

For deeper coverage of the AI search API layer, see our AI search APIs guide and choosing a search API for AI.

The practical pattern most production teams converge on: pick one AI search API as the primary retrieval surface, and know which underlying index it depends on. If that dependency is unacceptable for your use case (legal, sovereignty, supplier risk), drop down a layer and call the index directly.

A decision tree

The fastest way to pick:

  1. You need a real-time queryable index, want ownership, and can spend $5+/1K queries. Brave Search API.
  2. You want the simplest agentic integration and accept Brave as the transitive supplier. Tavily.
  3. You need semantic / conceptual search and accept a different index philosophy. Exa.
  4. You need legally licensed content with strict provenance. Linkup.
  5. You are building training data or a custom RAG corpus and have infrastructure budget. Common Crawl.
  6. You need an EU-based privacy-first complementary index. Mojeek.
  7. You are building a research agent that needs small-web or non-commercial sources. Marginalia.
  8. You are operating in the Russian market specifically and accept the geopolitical exposure. Yandex.
  9. You want to support open-source search philosophy without depending on it for production. Stract or Alexandria, alongside a production index.

What not to pick as a "Google alternative"

Three categories of provider get marketed as Google alternatives but do not actually bypass Google:

  • SERP scrapers. SerpApi, ScraperAPI's Google endpoint, and similar services hit Google's SERPs on your behalf. They do not solve the dependency problem; they amplify it by adding a legal grey zone.
  • Google Custom Search Engine. This is Google. It is the cheapest way to query Google's index programmatically, but it is still Google.
  • Bing wrappers. DuckDuckGo, Startpage, and several "privacy" search engines re-rank Bing results. With Bing's developer API gone, this category is also under stress.

If your goal is to reduce dependency on Google, only providers that operate their own crawler and index qualify. Everything else is a re-skin.

For broader context on how independent search fits into agentic systems, see our agentic web access guide.

Frequently asked

Why would I use an independent search index instead of Google?
Five reasons: legal exposure (scraping Google's SERPs has been litigated; SerpApi was sued by Google in 2024), terms-of-service risk (Google can change pricing or pull access unilaterally, as Microsoft did with the Bing Search API in August 2025), rate limits, per-query cost at scale, and dependency risk on a single supplier. AI builders shipping products that depend on web search increasingly want the option to own their data path.
Which independent indexes are actually usable in production today?
Three. Brave Search API (40B+ pages, $5/1K queries, real consumer browser feeding crawl signal), Common Crawl (9.5 PB free open archive, but you run the infrastructure), and Mojeek (~3.6B pages, paid API). Stract, Marginalia, Alexandria, and Gigablast exist as ideologically interesting projects but are not yet production-grade.
Is DuckDuckGo an independent search engine?
No. DuckDuckGo's web results come primarily from Bing under a commercial agreement. It is a privacy-respecting front end, not an independent index. The same applies to most consumer 'Google alternative' search engines – they re-rank Bing or Google results rather than crawling the web themselves.
How do AI search APIs like Tavily and Exa relate to the underlying indexes?
Tavily is built on top of Brave Search API plus its own extraction and ranking layer. Exa runs its own embeddings-based crawl and is independent of Brave or Google. Linkup builds an index from publisher-licensed content. Most production AI products query one of these layers rather than calling Brave or Common Crawl directly, because the layer handles content extraction, ranking, and LLM-friendly formatting in one call.
Can I just use Common Crawl as my search index?
You can, but it is not a queryable API. Common Crawl is a 9.5 PB monthly archive of raw web pages distributed via S3 and HTTP. To search it you need to download relevant slices, build your own index (typically Lucene, OpenSearch, or a vector store), and maintain that infrastructure. Free in cost, expensive in engineering. Most teams use Common Crawl for training data or one-off corpus building, not live search.
web indexessearchai agentstutorials

Weekly briefing — tool launches, legal shifts, market data.