serp.fast

Trafilatura

Python library for main-content extraction – takes HTML you've already fetched and returns clean text or markdown stripped of nav, ads, and chrome.

Maintained by Nathan Kessler·Updated

Open source scraping frameworks give engineering teams full control over their web data pipeline. You choose where to deploy, how to scale, and what data to collect – with no vendor lock-in or per-request pricing. The trade-off is infrastructure maintenance and anti-bot engineering, which commercial APIs handle for you.

Features

JS Rendering
Structured Output
Open Source
Self-Hosted Option
Pricing:Free

Editorial assessment

A focused content extractor built around the algorithm described in the ACL 2021 demo paper. Wins most accuracy benchmarks against open-source competitors and outputs plain text, markdown, or XML – the right default for any Python pipeline that already has the HTML. Use it when you can fetch the HTML yourself and the page doesn't need JS rendering. Reach for Mozilla Readability instead if your stack is Node, Beautiful Soup if you need raw parsing rather than article extraction, or Firecrawl/Crawl4AI when you also need fetching, rendering, and anti-bot bypass in one call.

How Trafilatura compares

Mozilla Readability

Mozilla Readability is the Node equivalent – similar quality, different language ecosystem.

Crawl4AI

Crawl4AI bundles fetching, JS rendering, and LLM-ready output that Trafilatura intentionally leaves out.

Beautiful Soup

Beautiful Soup is a general HTML parser, not an article extractor – use it when you need full control over selection.

Weekly briefing — tool launches, legal shifts, market data.

Visit

Trafilatura

Visit →