serp.fast

Web Content Extraction Benchmarks: How to Evaluate Scraping Quality

WCXB, WebMainBench, article-extraction-benchmark, SWDE – open benchmarks measuring how well extraction tools handle articles, forums, products, and collections.

Nathan Kessler··Reviewed
6 min read

Each tool referenced is evaluated against our methodology using public docs, vendor demos, and hands-on testing.

Picking a web content extraction library by reading GitHub stars or vendor marketing produces unreliable results. The quality gap between tools is real, and it is unevenly distributed across page types. Open benchmarks have quantified this gap – but the benchmarks themselves differ in scope, metric, and what they actually measure. Here is what each one tells you and what it does not.

Why extraction quality is harder to assess than it looks

Most developers test extraction tools on a handful of news articles. News articles are the easy case: the main content is clearly delimited, the markup is mostly semantic, and every major extractor performs well. Move to e-commerce product pages, forum threads, documentation sites, or content aggregators and the picture changes. Tools that look equivalent on article-only tests diverge by 20–30 F1 points on mixed-type corpora.

The Web Content Extraction Benchmark (WCXB) – the most comprehensive open benchmark as of 2026 – quantified this specifically. Across 2,008 annotated pages spanning articles, blogs, forums, products, service pages, and collections, the top heuristic extractors cluster within 2–3 F1 points on articles (0.91–0.93) but spread by 20+ points on forums, products, and collections. If your AI pipeline ingests anything other than pure news articles, an article-only benchmark is the wrong signal.

The main open benchmarks

WCXB – Web Content Extraction Benchmark

Source: webcontentextraction.org | GitHub | Hugging Face

The largest open benchmark for web content extraction, boilerplate removal, and main content detection. The dataset contains 2,008 manually annotated pages from 1,613 domains, split into 1,497 development pages and a 511-page held-out test set. Pages are tagged across 7 structural types; 47% are non-articles.

Metric: word-level F1 – the harmonic mean of precision and recall measured by word-level overlap between extracted text and the human-labeled ground truth.

What it evaluates: 12+ extractors tested, 10 heuristic-based (Trafilatura, rs-trafilatura, Mozilla Readability, newspaper4k, jusText, boilerpipe, inscriptis, and others) and 2 neural (MinerU-HTML and a hybrid pipeline).

Key results (April 2026 leaderboard):

  • rs-trafilatura alone: F1 0.893 on the held-out set
  • rs-trafilatura + MinerU-HTML fallback (~8% of pages routed to the LLM): F1 0.910
  • MinerU-HTML helps on articles, forums, and service pages but degrades on collections and products – illustrating that neural extractors are not uniformly better

WCXB is the right benchmark to cite when choosing a library for a general-purpose AI pipeline. Run your candidate tools through the development set yourself; the leaderboard tracks held-out results to prevent overfitting.

Scrapinghub article-extraction-benchmark

Source: GitHub – scrapinghub/article-extraction-benchmark | Hugging Face mirror

Published by Zyte (formerly Scrapinghub), this benchmark focuses specifically on news and blog article body extraction. It evaluates commercial services (Zyte Automatic Extraction, Diffbot) and open-source libraries (newspaper4k, readability-lxml, dragnet, boilerpipe, Trafilatura, go-trafilatura, Mozilla Readability, Readability.js, news-please, Goose3, inscriptis, html2text, jusText, BeautifulSoup).

Metric: token-level F1 on article body text.

Key results:

  • dragnet: F1 0.907 ± 0.014
  • boilerpipe: F1 0.860 ± 0.016
  • Trafilatura consistently near or above dragnet on most subsets
  • dragnet is no longer maintained and is included for reference only

The Scrapinghub benchmark is the right reference for news-article pipelines. Its scope is narrower than WCXB – it does not cover forum threads, product pages, or collections – but the labeled dataset is higher quality and more directly comparable across the specific libraries it covers.

WebMainBench

Source: Hugging Face – opendatalab/WebMainBench | GitHub – MinerU-HTML

A large-scale benchmark built by the MinerU-HTML team (OpenDataLab). Version 1.1 contains 7,809 annotated pages with corresponding ground-truth Markdown. A 100-page public evaluation subset is available; the full dataset is used internally.

Metric: ROUGE-N, measuring n-gram overlap between extracted Markdown and the human-labeled Markdown ground truth.

Key results (from MinerU-HTML evaluation):

  • MinerU-HTML (fine-tuned Qwen3-0.6B): 81.8% ROUGE-N F1
  • Trafilatura (Python): 63.6% ROUGE-N F1

The ROUGE-N gap between Trafilatura and a neural extractor is larger on this benchmark than the word-level F1 gap on WCXB because ROUGE-N penalizes differences in formatting, structure, and element ordering that word-level F1 ignores. WebMainBench is the more relevant benchmark when your pipeline cares about Markdown fidelity – for RAG chunking, LLM training data generation, or agentic extraction pipelines where formatting affects how the model reads the content.

SWDE – Structured Web Data Extraction Dataset

Source: Academic Torrents | LM Evaluation Harness task

The classic benchmark for structured attribute extraction. SWDE contains 124,291 pages across 80 websites and 8 verticals – auto, book, camera, job, movie, NBA players, restaurant, and university. The task is attribute-level extraction: given a page, extract specific named fields (price, model, author, date) rather than freeform body text.

When to use it: SWDE is not a benchmark for boilerplate removal or Markdown output. Use it when your pipeline needs structured fields from web pages and you want to compare cross-site generalization. Tools like Diffbot, ScrapeGraphAI, and AgentQL are the relevant comparison class here, not Trafilatura or Readability.

Commercial API extraction quality

The open benchmarks above cover open-source libraries. Commercial extraction APIs – Firecrawl, Jina AI Reader, Crawl4AI, ScrapingBee, and others – are not systematically evaluated by any of the open benchmarks. A 2026 commercial comparison by WebPeel tested Firecrawl, Jina Reader, Exa, Tavily, LinkUp, and ScrapingBee across 30 URLs in 6 categories; the gaps between services were real, and no service won every category.

The practical implication: open-benchmark rankings tell you which library algorithm is more accurate, but they do not tell you whether Firecrawl or Jina Reader is better for your specific corpus. Those services add rendering, anti-bot bypass, and post-processing on top of extraction – factors that matter for your production accuracy but are outside the scope of library benchmarks run on already-fetched static HTML.

For a more grounded comparison of how these tools fit AI pipelines specifically, see the web scraping API guide for AI agents.

Running a domain-specific eval

Benchmark results on third-party corpora are a starting point, not a final answer. The WCXB 2,008-page corpus is well-stratified, but your actual target domains may skew toward page types where the benchmark distribution doesn't match yours.

A practical domain eval takes one or two engineer-days:

  1. Sample 50–200 pages from your real target corpus. Include pages from your most common domains and a tail of long-tail domains.
  2. Manually annotate the ground-truth main content for each page. Label the exact text the model should see, not the full HTML.
  3. Run your top two or three candidate extractors. Trafilatura and Mozilla Readability are the right baselines; add whatever commercial API you're evaluating.
  4. Score with word-level F1. Stratify by page type and domain – don't average across article pages and product pages, because the tools behave differently on each type.
  5. Pick the extractor that wins on your specific distribution. If no single tool dominates, use the winner per page type and route accordingly.

This is the methodology WCXB describes. For RAG pipelines and LLM training data, the accuracy of the extraction step directly impacts downstream quality – running a domain eval is cheaper than diagnosing why your embeddings are underperforming.

What the benchmarks do not cover

A few gaps worth noting for AI product teams:

JavaScript-rendered content. All three open benchmarks (WCXB, Scrapinghub, WebMainBench) operate on static HTML. They measure extraction quality given correct HTML, not whether the tool can handle JS-heavy pages. If your target sites render via React or other client-side frameworks, raw extraction quality is only part of the problem – see extract clean markdown from any URL for the rendering-plus-extraction picture.

Anti-bot success rates. Not in scope for any open benchmark. If target sites use Cloudflare, Akamai, or DataDome, extraction quality is irrelevant until you've solved the access problem.

Structured output accuracy. The Scrapinghub and WCXB benchmarks measure how well a tool extracts freeform body text. SWDE measures attribute-level extraction. Neither covers schema-based structured extraction – whether Firecrawl Extract, Diffbot, or ScrapeGraphAI correctly fills a specific JSON schema. That is a separate eval against your own target schema, not a generalized benchmark task.

For the benchmarks that evaluate browser agent task completion rather than content extraction, see the AI web agent benchmarks guide and the benchmarks category in the directory.

Frequently asked

What is the best benchmark for comparing web content extraction tools?
WCXB (webcontentextraction.org) is the most comprehensive open benchmark – 2,008 pages across 7 page types scored by word-level F1. It is the most revealing because it covers forums, products, and collections where tools diverge most, not just news articles where most tools perform similarly. For article-only pipelines, the Scrapinghub article-extraction-benchmark is more targeted.
Which web content extraction library performs best on benchmarks?
As of April 2026, rs-trafilatura with a MinerU-HTML fallback for low-confidence pages achieves the best WCXB held-out F1 (0.910). Trafilatura (Python) achieves the best mean F1 across a combined 8-dataset benchmark (0.883 per the SIGIR 2023 paper). Mozilla Readability has the highest median in that study (0.970) but lower mean, meaning it's consistent on simple article pages but degrades on complex ones.
How do Firecrawl, Jina Reader, and other commercial APIs compare on extraction quality?
Commercial extraction APIs are not currently covered by the open benchmarks, which focus on open-source libraries. A 2026 commercial comparison by WebPeel found meaningful quality gaps between hosted services on non-article content. The most reliable approach is to run a sample of your own target URLs through two or three commercial services and score them against human-labeled output – the open-benchmark rankings don't transfer directly.
What is SWDE and when should I use it?
SWDE (Structured Web Data Extraction Dataset) is a classic benchmark of 124,291 pages across 80 websites and 8 verticals (auto, book, camera, etc.). It measures attribute-level extraction accuracy – whether a tool correctly identifies a product price, title, or specification – rather than freeform body text extraction. Use SWDE when your pipeline needs structured fields, not Markdown.
How do I run my own extraction quality eval?
Select 50–200 pages from your real target domains. Manually annotate the ground-truth main content for each page. Run your candidate extractors and score with word-level F1 (precision and recall of words in extracted vs. ground-truth text). Stratify by page type – don't average across articles and forums, since the distributions differ. This methodology follows WCXB and takes one or two engineer-days on a well-defined corpus.
benchmarksweb scrapingragextractiontutorials

Weekly briefing — tool launches, legal shifts, market data.