Open Source Frameworks (11)
Self-hosted scraping and crawling frameworks. You run the infrastructure, you own the pipeline.
Scrapy
Most PopularThe original Python web crawling framework — battle-tested, extensible, and the foundation of the modern scraping ecosystem.
Playwright
Editor's PickMicrosoft's cross-browser automation library for end-to-end testing and web scraping — supports Chromium, Firefox, and WebKit.
Puppeteer
Google's Node.js library for controlling Chrome/Chromium — the original headless browser automation tool for the JavaScript ecosystem.
Crawlee
Full-featured web scraping and browser automation library by Apify — wraps Playwright and Puppeteer with crawling primitives.
Selenium
The granddaddy of browser automation — supports all major browsers with bindings for Python, Java, C#, Ruby, and JavaScript.
Crawl4AI
Fully open-source LLM-friendly web crawler designed for RAG and AI agents — the most-starred crawler on GitHub at 50K+ stars.
Beautiful Soup
Python HTML/XML parser that turns messy markup into navigable parse trees — the gateway drug for web scraping.
Cheerio
Fast, flexible jQuery-like HTML parser for Node.js — Beautiful Soup's JavaScript equivalent for server-side HTML processing.
Colly
Fast and elegant scraping framework for Go — high-performance concurrent crawling with a clean callback-based API.
MechanicalSoup
Python library for automating website interactions — combines Requests and Beautiful Soup for stateful browsing with form submission.
HTTPx + Parsel
Modern Python HTTP client (HTTPx) paired with Scrapy's extraction library (Parsel) — lightweight async scraping without a framework.