Trafilatura
Open source scraping frameworks give engineering teams full control over their web data pipeline. You choose where to deploy, how to scale, and what data to collect – with no vendor lock-in or per-request pricing. The trade-off is infrastructure maintenance and anti-bot engineering, which commercial APIs handle for you.
How Trafilatura compares
Frequently asked questions
Is Trafilatura free?
Yes. Trafilatura is a free, open-source Python package. There is no paid tier, no license fee, and no hosted service to subscribe to. You install it with pip install trafilatura and run it as a library or command-line tool on your own machine. It processes HTML locally, so your only costs are the compute you already run and whatever you spend fetching pages. Trafilatura does not bill you for either.
Is Trafilatura open source?
Yes. Trafilatura is open source, developed in the open by Adrien Barbaresi, with the source on GitHub. Recent releases use the Apache 2.0 license. Versions before 1.8.0 used GPLv3. It is a self-hosted Python library that runs inside your own pipeline, so there is no vendor account and no external dependency. You can read, audit, or fork the extraction code directly.
Does Trafilatura render JavaScript?
No. Trafilatura does not run a browser or execute JavaScript. It takes HTML you have already fetched, or downloads a static page, then extracts the main article text, metadata, and structure while stripping nav, ads, and boilerplate. For client-side-rendered pages whose content only appears after JavaScript runs, fetch the rendered HTML elsewhere first, or use a tool like Crawl4AI or Firecrawl that handles rendering itself.
What output formats does Trafilatura support?
Trafilatura returns the extracted main content as plain text, Markdown, CSV, JSON, HTML, XML, or XML-TEI. It also pulls metadata such as title, author, date, and site name, and preserves structure like paragraphs, headings, lists, and code blocks. The Markdown and JSON output make it a practical default for feeding cleaned page content into LLM pipelines or storage without writing your own post-processing layer.
How does Trafilatura compare to Beautiful Soup?
They solve different problems. Beautiful Soup is a general HTML and XML parser. You write selectors to pull specific elements, and it has no opinion about what counts as the main content. Trafilatura is purpose-built article extraction that decides which text is the body and discards the chrome automatically. Use Beautiful Soup when you need raw parsing or precise field-level scraping. Use Trafilatura when you want clean main-content text without hand-tuning selectors per site.
What is the best alternative to Trafilatura?
It depends on your stack. If you work in Node rather than Python, Mozilla Readability is the closest article-extraction equivalent. If you need raw element-level parsing instead of main-content extraction, use Beautiful Soup. If you also need fetching, JavaScript rendering, and anti-bot handling in one step, which Trafilatura does not do, reach for Crawl4AI or Firecrawl. Trafilatura stays the strongest choice when you already have the HTML.
Weekly briefing – tool launches, legal shifts, market data.
Visit
Trafilatura
