Question 1

Is Trafilatura free?

Accepted Answer

Yes. Trafilatura is a free, open-source Python package. There is no paid tier, no license fee, and no hosted service to subscribe to. You install it with pip install trafilatura and run it as a library or command-line tool on your own machine. It processes HTML locally, so your only costs are the compute you already run and whatever you spend fetching pages. Trafilatura does not bill you for either.

Question 2

Is Trafilatura open source?

Accepted Answer

Yes. Trafilatura is open source, developed in the open by Adrien Barbaresi, with the source on GitHub. Recent releases use the Apache 2.0 license. Versions before 1.8.0 used GPLv3. It is a self-hosted Python library that runs inside your own pipeline, so there is no vendor account and no external dependency. You can read, audit, or fork the extraction code directly.

Question 3

Does Trafilatura render JavaScript?

Accepted Answer

No. Trafilatura does not run a browser or execute JavaScript. It takes HTML you have already fetched, or downloads a static page, then extracts the main article text, metadata, and structure while stripping nav, ads, and boilerplate. For client-side-rendered pages whose content only appears after JavaScript runs, fetch the rendered HTML elsewhere first, or use a tool like Crawl4AI or Firecrawl that handles rendering itself.

Question 4

What output formats does Trafilatura support?

Accepted Answer

Trafilatura returns the extracted main content as plain text, Markdown, CSV, JSON, HTML, XML, or XML-TEI. It also pulls metadata such as title, author, date, and site name, and preserves structure like paragraphs, headings, lists, and code blocks. The Markdown and JSON output make it a practical default for feeding cleaned page content into LLM pipelines or storage without writing your own post-processing layer.

Question 5

How does Trafilatura compare to Beautiful Soup?

Accepted Answer

They solve different problems. Beautiful Soup is a general HTML and XML parser. You write selectors to pull specific elements, and it has no opinion about what counts as the main content. Trafilatura is purpose-built article extraction that decides which text is the body and discards the chrome automatically. Use Beautiful Soup when you need raw parsing or precise field-level scraping. Use Trafilatura when you want clean main-content text without hand-tuning selectors per site.

Question 6

What is the best alternative to Trafilatura?

Accepted Answer

It depends on your stack. If you work in Node rather than Python, Mozilla Readability is the closest article-extraction equivalent. If you need raw element-level parsing instead of main-content extraction, use Beautiful Soup. If you also need fetching, JavaScript rendering, and anti-bot handling in one step, which Trafilatura does not do, reach for Crawl4AI or Firecrawl. Trafilatura stays the strongest choice when you already have the HTML.

Trafilatura

How Trafilatura compares

Frequently asked questions