Open Source Frameworks (13)

Self-hosted scraping and crawling frameworks. You run the infrastructure, you own the pipeline.

5 strong choices for AI builders

Selected from 13 tools in this category. Equal-weighted; not ranked 1–5.

Microsoft's cross-browser automation library for end-to-end testing and web scraping – supports Chromium, Firefox, and WebKit.

Google's Node.js library for controlling Chrome/Chromium – the original headless browser automation tool for the JavaScript ecosystem.

The original Python web crawling framework – battle-tested, extensible, and the foundation of the modern scraping ecosystem.

Fully open-source LLM-friendly web crawler designed for RAG and AI agents – the most-starred crawler on GitHub at 50K+ stars.

Full-featured web scraping and browser automation library by Apify – wraps Playwright and Puppeteer with crawling primitives.

Also in this category

The granddaddy of browser automation – supports all major browsers with bindings for Python, Java, C#, Ruby, and JavaScript.

Python HTML/XML parser that turns messy markup into navigable parse trees – the gateway drug for web scraping.

Fast, flexible jQuery-like HTML parser for Node.js – Beautiful Soup's JavaScript equivalent for server-side HTML processing.

Fast and elegant scraping framework for Go – high-performance concurrent crawling with a clean callback-based API.

Python library for automating website interactions – combines Requests and Beautiful Soup for stateful browsing with form submission.

Modern Python HTTP client (HTTPx) paired with Scrapy's extraction library (Parsel) – lightweight async scraping without a framework.

Python library for main-content extraction – takes HTML you've already fetched and returns clean text or markdown stripped of nav, ads, and chrome.

The pure-JavaScript library Firefox uses for Reader Mode – extracts the primary article from an HTML document with no dependencies.