Where does most LLM training data come from?

The web. Common Crawl – a 501(c)(3) nonprofit operating since 2008 – provides 9.5+ petabytes of archived web data and crawls roughly 2.4 billion pages per month. By Common Crawl's own tracking, 64% of major LLMs use its data; over 80% of GPT-3's training tokens came from Common Crawl-derived datasets like C4 and The Pile. Books, papers, and curated datasets are higher per-token quality but finite.

Why isn't Common Crawl enough on its own?

Quality. The raw archive contains spam, SEO content, navigation boilerplate, near-duplicate articles syndicated across hundreds of sites, toxic content, and encoding errors. Every team that uses Common Crawl spends significant engineering effort on filtering, deduplication, language identification, and quality scoring. Commercial providers like Bright Data, Oxylabs, Firecrawl, Crawl4AI, and Diffbot exist largely to address that quality gap at scale.

What are the legal risks around training on web data?

Several open questions. Whether training on copyrighted web content is fair use is being litigated (NYT v. OpenAI, Getty v. Stability, others) with no appellate precedent yet. Robots.txt has unclear legal force. Most sites' terms of service prohibit automated access. Google's DMCA suit against SerpApi introduced an anti-circumvention theory that could create liability for bypassing anti-bot measures regardless of purpose. The trend is toward licensing – AP+OpenAI, Reddit+Google, Linkup's publisher-licensed model.

What's an independent web index and why does it matter?

An index built by crawling the web directly rather than scraping Google or Bing. Brave Search API operates the only Western independent index at scale (40B+ pages, 100M new daily). Exa runs its own embeddings index. Linkup licenses content from publishers. After Microsoft shut down the Bing Search API in August 2025 and Google sued SerpApi in December 2025, owning an independent index now offers a clear advantage – no Google supply risk, no DMCA exposure.

How does this affect my AI product?

Three ways. First, foundation model selection – training data shapes capability and bias, so understanding provenance helps explain why one model outperforms another for your use case. Second, RAG – the same infrastructure (search APIs, scraping APIs, extraction tools) that produces training data also powers retrieval, and RAG reportedly cuts hallucinations 70–90%. Third, fine-tuning – the data quality challenges that face training corpora apply to your fine-tuning corpus too.

Web Data for LLM Training: Sources, Quality & Scale

Each tool referenced is evaluated against our methodology using public docs, vendor demos, and hands-on testing.

Every major large language model is trained primarily on web data. GPT-3's training corpus was over 80% web-sourced. Llama, Mistral, Gemini, Claude – all depend on massive collections of text crawled from the internet. The quality, scale, and legality of this data directly determine model capability.

For product leaders building on top of these models, the web data supply chain matters for two reasons. First, it explains why models behave the way they do – their strengths, biases, and blind spots are artifacts of training data. Second, it clarifies the infrastructure landscape: the same tools and companies that crawl the web for training data also power the retrieval, grounding, and augmentation pipelines that make AI products useful in production.

Why the web is the primary training source

Language models need text. Lots of it. GPT-4-class models train on datasets measured in trillions of tokens. The only source of text at that scale is the web.

Books, academic papers, and curated datasets contribute meaningfully – they tend to be higher quality per token – but they are finite. The web is effectively unbounded. It contains specialized knowledge across every domain, in every language, at every level of formality. A model trained only on books would lack conversational fluency. One trained only on Wikipedia would lack depth on niche topics. The web provides breadth that no other source matches.

The practical consequence is that web crawling infrastructure is a basic requirement for the AI industry. The organizations that control access to web data at scale – whether through proprietary indexes, crawling infrastructure, or data licensing agreements – have significant leverage over the AI supply chain.

Common Crawl: the public web archive

Common Crawl is a 501(c)(3) nonprofit that has been crawling the web since 2008 and maintains an open archive of over 9.5 petabytes. It crawls roughly 2.4 billion pages per month and makes the data freely available on AWS.

The numbers on Common Crawl's role in LLM training are striking. According to the organization's own tracking, 64% of major LLMs used Common Crawl data in their training sets. Over 80% of GPT-3's training tokens came from Common Crawl-derived data (filtered and processed into datasets like C4 and The Pile). Meta's Llama models, Google's T5, and numerous other foundation models all list Common Crawl as a primary data source.

Common Crawl's value is its openness and scale. Its limitation is data quality. The raw archive contains enormous amounts of spam, boilerplate navigation text, cookie banners, duplicate content, and low-quality pages. Every team that uses Common Crawl for training spends significant engineering effort on filtering and deduplication. The dataset is a starting point that needs substantial processing before use.

This quality gap is the opportunity that commercial data providers address.

Data quality challenges at scale

The gap between "all the text on the web" and "text useful for training a language model" is enormous. Several categories of quality problems dominate.

Spam and SEO content. A significant fraction of the web consists of content generated primarily for search engine manipulation – keyword-stuffed articles, auto-generated pages, link farms. Including this in training data degrades model quality. Detecting and filtering it at the scale of billions of pages is a non-trivial engineering problem.

Duplication. The same content appears at multiple URLs. News articles get syndicated across hundreds of sites. Press releases appear verbatim on wire services, company blogs, and aggregators. Training on duplicated content overweights those topics and perspectives in the model. Near-duplicate detection – content that is similar but not identical – is computationally expensive at web scale.

Boilerplate and navigation. Every web page contains text that is not content: navigation menus, footers, cookie consent banners, sidebar widgets, ad copy. Extracting the actual content from a page – the article text, the product description, the forum post – requires understanding page structure. This is where extraction tools and content parsers become relevant to the training data pipeline.

Toxic and harmful content. The web contains content that, if included in training data, produces models that generate harmful outputs. Filtering for toxicity at scale requires its own classifiers, which themselves need to be trained and maintained.

Freshness decay. Web content goes stale. Pages get updated, deleted, or redirected. A training dataset crawled six months ago contains dead links, outdated information, and content that no longer reflects the current state of its source. This is less of a problem for pretraining (models do not need to know today's news) but critical for fine-tuning and RAG applications.

Language and encoding issues. The web contains text in hundreds of languages, often mixed within single pages. Character encoding errors, transliteration artifacts, and language identification mistakes all degrade data quality.

No single tool solves all of these. Training data preparation involves multi-stage pipelines combining crawling, extraction, deduplication, language identification, quality scoring, and toxicity filtering. The teams that do this well treat it as a core engineering function, not an afterthought.

Commercial web data for training

Several companies have built businesses specifically around providing high-quality web data at the scale required for LLM training.

Bright Data operates the largest proxy infrastructure in the industry, with over 150 million residential IPs across 195 countries. Its relevance to LLM training is both direct and indirect. Directly, Bright Data offers structured datasets and web scraping infrastructure used by AI companies for training data collection. Indirectly, its proxy network underpins much of the web scraping activity that feeds into training pipelines. The company has positioned its Data4AI initiative to serve this market explicitly.

Oxylabs, the other major proxy and web data provider, serves a similar market segment. Both companies provide the infrastructure layer – proxies, SERP APIs, scraping APIs – that enables large-scale web data collection. Their enterprise clients include companies building foundation models, though specific customer names are typically not disclosed.

At a different layer, Firecrawl and Crawl4AI address the extraction problem. Raw HTML is wasteful as training data. What training pipelines need is clean text – article bodies, product descriptions, documentation – stripped of boilerplate and formatting artifacts. Firecrawl converts websites to LLM-ready markdown via API, handling JavaScript rendering, proxy management, and content extraction in one call. Crawl4AI, fully open-source, provides similar capabilities for teams that prefer self-hosted infrastructure.

Apify's marketplace of over 10,000 pre-built scraping Actors enables domain-specific data collection. Rather than building custom scrapers for each target site, teams can compose Actors for specific sources – job boards, e-commerce sites, social platforms, news outlets – into data collection pipelines.

Zyte brings fifteen-plus years of web data extraction experience, including the creation and maintenance of Scrapy, the most widely used Python scraping framework. Their enterprise data-as-a-service offering handles the full pipeline from crawling to structured data delivery.

Diffbot takes a fundamentally different approach, using computer vision and NLP to understand web pages structurally. Its knowledge graph contains over one trillion facts extracted from the web, making it a source for structured entity data rather than raw text.

The role of web indexes

Companies that maintain their own crawled index of the web have an advantage in the training data supply chain.

Brave Search API provides access to an independent index of over 40 billion pages, with 100 million new pages added daily. Brian Brown, Brave's Chief Business Officer, has stated that the Brave Search API "currently supplies most of the top 10 AI LLMs with real-time Web search data." For teams that need access to web-scale data without scraping Google or Bing, Brave's index is one of very few options.

Exa maintains its own embeddings-based web index, which means it has crawled and indexed a substantial portion of the web independently. This data asset serves its search API but also positions the company in the training data supply chain.

Linkup is building a proprietary index with a distinctive approach – licensing content from publishers directly. This addresses one of the most contentious issues in the training data landscape: whether crawling and using web content for AI training constitutes fair use or requires explicit permission.

The strategic value of maintaining an independent web index has increased significantly since Microsoft's shutdown of the Bing Search API in August 2025 and Google's DMCA lawsuit against SerpAPI in December 2025. Companies that depend on scraping Google for web data face both legal risk and supply risk. Those with independent indexes face neither.

Legal considerations

The legal landscape around web data for AI training is unsettled and evolving rapidly. Product leaders need to understand the current state, even though the final legal framework is far from established.

Copyright. The central legal question is whether training AI models on copyrighted web content constitutes fair use (in the US) or falls under equivalent exceptions in other jurisdictions. Multiple lawsuits are working through the courts, including cases brought by the New York Times, Getty Images, and various authors' groups against model developers. No definitive precedent has been set at the appellate level.

Robots.txt. The robots.txt standard tells crawlers which pages a site operator does not want crawled. Historically, this was an informal convention primarily relevant to search engines. As AI companies began crawling the web aggressively for training data, many publishers updated their robots.txt files to block AI crawlers. Whether robots.txt has legal force – and whether ignoring it creates liability – remains debated.

Terms of service. Most websites' terms of service prohibit automated access. Whether these terms create enforceable contractual obligations against parties who never affirmatively agreed to them is jurisdiction-dependent and factually complex.

The DMCA angle. Google's lawsuit against SerpAPI introduced a new legal theory: that bypassing anti-scraping technical measures violates the DMCA's anti-circumvention provisions (Section 1201). If this theory prevails, it would create criminal and civil liability for circumventing anti-bot systems, regardless of the purpose of the scraping.

Data licensing. The emerging alternative to the legal uncertainty of scraping is direct licensing. Several major publishers have signed data licensing agreements with AI companies – the Associated Press with OpenAI, Reddit with Google, various news organizations with multiple model developers. Linkup's publisher-licensing model reflects this trend.

For product leaders, the practical guidance is: know your data provenance. If your product depends on a foundation model, understand where its training data came from. If you are collecting web data for fine-tuning or RAG, understand the legal basis for that collection. The landscape is shifting toward a regime where data access requires either explicit licensing or strong fair-use arguments, and "we just crawled it" may not be sufficient.

How this connects to your AI product

The web data supply chain matters to product leaders at three levels.

Foundation model selection. The training data behind your chosen model affects its capabilities. Models trained on more diverse, higher-quality web data tend to perform better on real-world tasks. Understanding data provenance helps explain why one model outperforms another on your specific use case.

RAG and grounding. Most production AI products supplement the base model with real-time web data through retrieval-augmented generation. The same infrastructure that serves training data – search APIs, scraping APIs, extraction tools – also powers RAG pipelines. The RAG market is projected at $1.9 billion in 2025, growing at 35-49% CAGR, specifically because models with access to current web data produce dramatically better results. Studies show RAG reduces hallucinations by 70-90%.

Fine-tuning. If you fine-tune models on domain-specific web data, the quality of that data directly determines the quality of the fine-tuned model. All the data quality challenges described above – spam, duplication, boilerplate, staleness – apply to your fine-tuning corpus.

The tools covered in this guide – from Common Crawl for raw scale, to Bright Data and Oxylabs for infrastructure, to Firecrawl and Crawl4AI for extraction, to Exa and Brave Search API for indexed search – form a supply chain that feeds directly into every layer of the AI stack. Understanding that supply chain is not optional for leaders building AI products. It is the difference between treating the model as opaque and understanding how its data inputs shape its behavior.

The market ahead

Two dynamics will shape the web data training market over the next several years.

First, consolidation. The Tavily acquisition by Nebius ($275-400 million for a fifteen-month-old company) and Jina AI's acquisition by Elastic signal that web data infrastructure is being absorbed into larger platform companies. Independent providers will either grow large enough to remain independent or be acquired.

Second, the shift from scraping to licensing. As legal pressure increases and publishers become more sophisticated about the value of their content to AI companies, the training data market will increasingly operate through formal data agreements rather than unilateral crawling. Companies positioned on the licensing side of this transition – those with publisher relationships, independent indexes, or ethical data sourcing practices – will be harder for competitors to displace.

For product leaders evaluating web data infrastructure, the question is not just which tool works today. It is which tool's data sourcing model will remain viable as the legal and commercial landscape matures.