serp.fast

Trafilatura

Python library for main-content extraction – takes HTML you've already fetched and returns clean text or markdown stripped of nav, ads, and chrome.

Nathan Kessler
By Nathan KesslerUpdated

Each tool is evaluated against our methodology using public docs, vendor demos, and hands-on testing.

Open source scraping frameworks give engineering teams full control over their web data pipeline. You choose where to deploy, how to scale, and what data to collect – with no vendor lock-in or per-request pricing. The trade-off is infrastructure maintenance and anti-bot engineering, which commercial APIs handle for you.

Features

JS Rendering
Structured Output
Open Source
Self-Hosted Option
Pricing:Free

Editorial assessment

A focused content extractor built around the algorithm described in the ACL 2021 demo paper. Wins most accuracy benchmarks against open-source competitors and outputs plain text, markdown, or XML – the right default for any Python pipeline that already has the HTML. Use it when you can fetch the HTML yourself and the page doesn't need JS rendering. Reach for Mozilla Readability instead if your stack is Node, Beautiful Soup if you need raw parsing rather than article extraction, or Firecrawl/Crawl4AI when you also need fetching, rendering, and anti-bot bypass in one call.

How Trafilatura compares

Mozilla Readability

Mozilla Readability is the Node equivalent – similar quality, different language ecosystem.

Crawl4AI

Crawl4AI bundles fetching, JS rendering, and LLM-ready output that Trafilatura intentionally leaves out.

Beautiful Soup

Beautiful Soup is a general HTML parser, not an article extractor – use it when you need full control over selection.

Frequently asked questions

Is Trafilatura free?

Yes. Trafilatura is a free, open-source Python package. There is no paid tier, no license fee, and no hosted service to subscribe to. You install it with pip install trafilatura and run it as a library or command-line tool on your own machine. It processes HTML locally, so your only costs are the compute you already run and whatever you spend fetching pages. Trafilatura does not bill you for either.

Is Trafilatura open source?

Yes. Trafilatura is open source, developed in the open by Adrien Barbaresi, with the source on GitHub. Recent releases use the Apache 2.0 license. Versions before 1.8.0 used GPLv3. It is a self-hosted Python library that runs inside your own pipeline, so there is no vendor account and no external dependency. You can read, audit, or fork the extraction code directly.

Does Trafilatura render JavaScript?

No. Trafilatura does not run a browser or execute JavaScript. It takes HTML you have already fetched, or downloads a static page, then extracts the main article text, metadata, and structure while stripping nav, ads, and boilerplate. For client-side-rendered pages whose content only appears after JavaScript runs, fetch the rendered HTML elsewhere first, or use a tool like Crawl4AI or Firecrawl that handles rendering itself.

What output formats does Trafilatura support?

Trafilatura returns the extracted main content as plain text, Markdown, CSV, JSON, HTML, XML, or XML-TEI. It also pulls metadata such as title, author, date, and site name, and preserves structure like paragraphs, headings, lists, and code blocks. The Markdown and JSON output make it a practical default for feeding cleaned page content into LLM pipelines or storage without writing your own post-processing layer.

How does Trafilatura compare to Beautiful Soup?

They solve different problems. Beautiful Soup is a general HTML and XML parser. You write selectors to pull specific elements, and it has no opinion about what counts as the main content. Trafilatura is purpose-built article extraction that decides which text is the body and discards the chrome automatically. Use Beautiful Soup when you need raw parsing or precise field-level scraping. Use Trafilatura when you want clean main-content text without hand-tuning selectors per site.

What is the best alternative to Trafilatura?

It depends on your stack. If you work in Node rather than Python, Mozilla Readability is the closest article-extraction equivalent. If you need raw element-level parsing instead of main-content extraction, use Beautiful Soup. If you also need fetching, JavaScript rendering, and anti-bot handling in one step, which Trafilatura does not do, reach for Crawl4AI or Firecrawl. Trafilatura stays the strongest choice when you already have the HTML.

Weekly briefing – tool launches, legal shifts, market data.

Visit

Trafilatura

Visit →