Notable repositories
AI Scraping
The standard for AI-friendly web scraping. Open-source core, excellent hosted API. If you're building RAG or AI data pipelines, start here.
The most-starred AI browser tool by a wide margin. Lets AI agents browse the web like a human. The future of web interaction, today.
The fastest-growing AI scraping tool. Async-first, produces clean markdown for LLM pipelines. The open-source answer to Firecrawl.
The useful part is source grounding: every extracted field maps back to the exact span it came from, which matters when you need to audit the output. Built around Gemini but applicable to document and text extraction generally. Reach for it when the job is pulling reliable structure out of text you already have rather than scraping live pages.
Novel approach: uses LLMs to plan and execute scraping tasks. Impressive for complex extraction but LLM costs add up at scale.
From the Browserbase team. Tell it what to do in plain English, it drives Playwright. Early-stage but the developer experience is unmatched.
You build a scraper by pointing and clicking rather than writing selectors, then get an API or spreadsheet out the other end. Best for non-engineers or teams who would rather not maintain scraper code, with the usual tradeoff that point-and-click breaks more easily than hand-written extraction. One of the more popular self-hosted options in this space.
Prepend a prefix to a URL and get back clean Markdown instead of raw HTML. The obvious comparison is Firecrawl; Reader covers the narrower single-page-to-Markdown job well, with both a hosted endpoint and a self-host option. A reasonable first stop when you just need readable text for a RAG or agent pipeline.
You define the output shape with a Zod schema and get typed data back, with the choice of local or hosted models. Because it runs on Playwright, you get real browser rendering rather than raw HTML parsing. The natural pick for TypeScript teams that want schema-validated extraction without leaving their stack.
Document-focused rather than web-focused, with topics pointing at contract analysis and DOCX handling. The pitch is less setup code to get from a document to structured fields. Smaller and newer than the bigger names here, so worth a look when your inputs are files rather than live pages.
Rust-powered Firecrawl alternative with TLS fingerprinting instead of headless browsers. Drop-in v2 API compatibility, 12-tool MCP server. AGPL-3.0 licensed. Early but fast-moving.
The aim is keeping token usage down while still doing LLM-driven extraction, which matters once you are scraping at volume and watching API costs. Playwright underneath handles the browsing. A reasonable lightweight option for LLM scraping in Python, though it is smaller and has been pushed less recently than the leaders here.
AI Web Agents
A Manus-style agent built to run entirely on your own hardware, trading hosted-model quality for zero per-task cost and full data control. Reach for it when local-only operation is a hard requirement; if you just want the best output, a cloud-model agent will beat it. One of 2025's fastest-growing agent projects.
The selector-free approach is the point: it reads the page like a user instead of breaking every time the DOM shifts, which suits long-tail sites you do not control. A reasonable default for teams who want browser automation that survives layout changes, with the usual vision-agent cost in latency and model spend per run.
A self-hosted alternative to OpenAI Operator that keeps you on your own model keys and inside your own browser session. A good fit if you want Operator-style automation without handing a vendor your accounts and credentials. Check recent commit activity before betting a product on it.
A full agentic browser rather than an extension or library, aimed at people who want the Atlas/Comet experience without a closed vendor. The Chromium base lets you bring your own models and self-host. Worth a look if you are evaluating agentic browsers but want to avoid lock-in.
A research project from Microsoft built on AutoGen, with a human-in-the-loop focus rather than full autonomy. Treat it as a reference design and a place to study computer-use UX, not a production tool. Useful if you want to see how Microsoft is thinking about agent control surfaces.
An early entrant in the natural-language-to-browser-action space, sitting on the Selenium and Playwright stacks most teams already know. Development has been quiet since early 2025, so newer browser-agent frameworks are the safer choice for fresh projects. Still readable as a clean example of the Large Action Model pattern.
Pitches itself as a vision-first alternative to browser-use and Stagehand, which is the comparison worth running. The vision-first design helps on sites where the DOM is hostile, at the cost of more model calls per step. A reasonable pick for TypeScript teams already on Playwright.
Aims past the local-script stage at running agents as serverless functions on hosted browser infra, the part most teams underestimate. Of interest if you have outgrown running Playwright on your own boxes and want managed scaling. Smaller and newer than the headline frameworks, so expect a thinner ecosystem.
Not an agent itself but a building block: it tags interactive elements and runs OCR so a vision model can reason about a page. Useful if you are assembling your own agent and need the page-understanding layer. The repo has had no commits since late 2024, so vet it before depending on it.
Playwright with a natural-language layer on top, so you keep the familiar API and drop the brittle hand-written selectors. The obvious fit for teams already on Hyperbrowser's hosted browser infrastructure. Compare it against browser-use and Stagehand before committing.
A research-grade web agent with a hierarchical planning design, tied to Emergence AI's hosted web-automation API. More valuable as a benchmark reference and an architecture to study than as a drop-in library. Look here if you care about how WebVoyager-style agents are structured.
The selling point is on-device inference through tools like Ollama, so pages and prompts never leave your machine. The right choice when privacy or offline operation is non-negotiable, at the cost of being capped by whatever model your hardware can run. Less capable than a cloud-model assistant, by design.
Bills itself as "Cline for web browsing," which captures the idea: a conversational assistant that acts inside your browser rather than a headless automation framework. Better for interactive, ad-hoc tasks than scripted pipelines. Activity slowed in late 2025, so confirm it is still moving before relying on it.
Anti-Detection
The pragmatic answer when your scraper or app hits a Cloudflare wall and you don't want to embed a browser everywhere. You run it as a Docker service and point your HTTP client at it. Anti-bot vendors break it periodically, so treat it as a moving target rather than a fixed solution.
Patching the fingerprint inside the browser binary, rather than injecting JavaScript after the fact, is what separates it from the spoof-by-script tools. One of the more credible open stealth browsers for scraping, and the Firefox-side counterpart to the Chromium undetected drivers. The tradeoff is a heavy custom build you have to keep current with upstream.
The plugin system behind most stealth scraping in Node. Its stealth plugin patches the many signals headless Chrome leaks, and works with both Puppeteer and Playwright. Maintenance has slowed, but it is still the default starting point.
A long-standing, lightweight option when you want to stay inside the requests workflow instead of spinning up a browser. It only addresses the JS-challenge layer, so it loses to newer Cloudflare defenses that camoufox or FlareSolverr handle. Commits have slowed, so verify it still clears your target before depending on it.
Clever approach: make curl look like a real browser at the TLS level. Works surprisingly well against Cloudflare and Akamai.
Reach for this when blocks happen at the TLS or fingerprint layer and a browser would be overkill. It gives you requests-like ergonomics with a handshake that looks like Chrome or Safari, which clears a large class of fingerprint-based bot walls. Actively maintained, and the most popular Python entry point to the curl-impersonate work.
Successor to undetected-chromedriver. Removes the webdriver dependency entirely, making detection even harder.
If you already have Playwright code, this is the lowest-friction way to make it harder to fingerprint: same API, fewer automation tells. It targets the leak surface (the Runtime.Enable tell and similar) rather than spoofing fingerprints, so pair it with TLS or fingerprint tooling for the full picture.
The dependency under most Go TLS-impersonation clients, including tls-client. Most teams will use it indirectly through a higher-level wrapper rather than wiring ClientHello control by hand; reach for it directly only when you need control the wrappers don't expose. Maintained, with roots in the anti-censorship community rather than scraping.
The fingerprint-consistency piece of an anti-detection stack: it produces headers, navigator properties, and other signals that agree with each other instead of contradicting. Backed by Apify and a sensible fit in a Crawlee-based setup. It handles spoofing, not TLS or CDP leaks, so it covers one layer rather than the whole problem.
A self-hosted, minimal alternative to running FlareSolverr or paying for a hosted solver, useful for small jobs where you want the bypass logic in your own process. It is a script rather than a managed service, so expect to do more of the upkeep yourself as Cloudflare changes.
A more modern take on the Go impersonation client than tls-client, with HTTP/3 and JA4 coverage the older libraries don't all have. The smaller user base is the tradeoff against bogdanfinn/tls-client's more established footprint. Worth a look if you need QUIC-level fingerprinting from Go.
A common building block for fingerprint-resistant scrapers in Go, and usable from other languages through its bindings. It sits a level above utls, so you select a browser profile instead of hand-building a ClientHello. The well-trodden choice for TLS impersonation without a browser; surf is the newer alternative if you need HTTP/3.
The Puppeteer-side take on the undetected-chromedriver idea: keep your existing automation code, get fewer detection tells. Convenient if you're already on Puppeteer, but commit activity has gone quiet, so confirm it still beats your target's current defenses before committing to it.
The reason to pick this over the Go-only clients is the JavaScript surface: it lets Node scrapers present a browser-identical handshake without leaving the JS ecosystem. Actively maintained. If you're purely in Go, tls-client or surf are the more conventional choices.
A focused fix for a specific, well-documented leak that Cloudflare and DataDome key on, and it can be toggled on or off on demand. It overlaps with patchright in goal; the difference is that this is a patch layer over your existing install rather than a full drop-in fork. Commits have slowed, so check it still covers the current detection signals.
A newer Python option that drives Chrome over the DevTools Protocol with no Selenium or webdriver dependency, which removes a common detection vector. It inherits nodriver's design and adds async ergonomics and containerization. A reasonable pick for Python teams who want a stealth-first browser-driving framework, though it is younger and less proven than the established names.
Browser Automation
Still the most popular browser automation tool by stars. Playwright is technically superior but Puppeteer's ecosystem is massive.
The backbone of modern scraping stacks. Microsoft-backed, fast, reliable. If you're doing JS-rendered scraping, you're probably using this.
Built for the agent use case rather than retrofitted from a testing tool. The high star count owes a lot to Vercel's gravity, so judge it on whether the agent-first ergonomics fit your stack rather than on popularity.
The grandfather of browser automation. Still relevant for legacy projects and teams with existing Selenium infrastructure. Modern projects should pick Playwright.
The bet is rebuilding the browser rather than wrapping Chrome, aiming for lower memory and CPU. Because it speaks CDP it can slot in behind Playwright or Puppeteer clients, but a from-scratch engine will lag Chromium on rendering edge cases. Worth a look if browser resource cost dominates your scraping bill.
The default for JS-heavy scraping in Python: cross-browser and maintained by Microsoft. Reach for it over Selenium unless you have a specific reason not to. It does no anti-bot evasion on its own, so pair it with a stealth layer for protected sites.
Solves the operational half of browser automation: running and pooling browsers at scale instead of writing the scripts. Self-host it or use the hosted cloud. The license is free for non-commercial use only, so check the terms before building a product on it.
The standard for browser automation in Go: dependency-free and the most-used library in its language. If you are already in a Go codebase this is the natural pick. The main alternative, go-rod, adds higher-level conveniences like auto-waiting, so compare ergonomics first.
Solves a real problem: getting past Cloudflare and similar anti-bot systems. Fragile by nature (Chrome updates break it regularly) but nothing else does this job.
The reason scrapers reach for it over plain Selenium or Playwright is its UC/CDP stealth mode against bot detection. The testing-framework heritage means a lot of surface area to learn, but the anti-detection work is the draw if you are hitting protected sites.
A self-hostable browser backend that bundles the infrastructure most agent projects end up rebuilding: sessions, proxies, CAPTCHA. Closest in spirit to browserless but aimed at the AI-agent workflow. Useful if you want to own the browser layer instead of renting a hosted API.
The higher-level alternative to chromedp in Go, with auto-waiting and stealth helpers that cut boilerplate. Pick it over chromedp if you value convenience methods, chromedp if you want a leaner core. Both are actively maintained.
Drives Chromium over CDP directly instead of going through a WebDriver, with anti-detection and CAPTCHA handling built in. A newer, fast-growing entrant in the same space as SeleniumBase's stealth mode. Promising for protected sites, though it has less of a track record than the older frameworks.
A testing tool first, not a scraping library. Reach for it when you want a readable BDD-style API that abstracts over multiple browser drivers. If your goal is data extraction, the underlying drivers like Playwright are a more direct fit.
Once the standard way to add JS rendering to a Scrapy pipeline, exposed as a Lua-scriptable HTTP service. No longer actively developed, with no commits since 2024. New projects should use Playwright or a headless-browser service instead, though it still works for legacy Scrapy setups.
For years the go-to async Puppeteer-style library for Python, but it has gone quiet, with no commits since 2024. The official playwright-python now covers the same ground with active Microsoft backing. Reach for Playwright on anything new; pyppeteer mostly survives in older codebases.
The clear choice for browser automation and dynamic-page scraping in .NET, actively maintained and faithful to the Puppeteer API. If you are in C# there is little reason to look elsewhere.
A focused utility for the concurrency problem: managing many Puppeteer browsers for queued crawl jobs without writing your own pool. Useful glue once you have settled on Puppeteer and need to scale out. It is tied to Puppeteer specifically, so it does not help Playwright users.
CAPTCHA Solving
The de-facto open-source CAPTCHA recognizer in the Chinese scraping world, with the star count to match. Best for everyday text and slider CAPTCHAs where you want results without training a model yourself. It won't touch behavioral challenges like reCAPTCHA v3 or Turnstile.
Broad coverage across the major CAPTCHA types in a single extension, which is rare. The catch is that it's the open-source client for a paid solving service, so the actual solving happens behind an API rather than locally. Reach for it when you need one tool that handles many challenge formats inside an automated browser.
Narrow by design and well maintained: it clears reCAPTCHA's audio path and nothing else. That makes it a clean, locally-running option for accessibility and light automation, but it covers only reCAPTCHA, not hCaptcha or Turnstile. The audio route has long been one of the more durable reCAPTCHA bypasses, which explains the following.
The build-your-own option: it trains models, it doesn't solve anything out of the box. Worth it only if you face a bespoke image CAPTCHA that ddddocr can't handle and you have labeled data to train on. For most users the pretrained route is the faster path.
One of the most-starred open-source hCaptcha solvers, and it has moved toward multimodal models rather than the fragile per-challenge classifiers older solvers relied on. The clear pick if hCaptcha specifically is your blocker and you're already automating with Playwright. LLM-backed solving means trading some latency and API cost for accuracy.
The same audio-challenge technique as Buster but as a scriptable Python library rather than a browser extension, which fits a scraping pipeline better. Good fit if you're driving DrissionPage or Selenium and need reCAPTCHA cleared in code. Google tightens the audio path periodically, so expect maintenance churn.
Focused squarely on Cloudflare Turnstile, which most of the broader CAPTCHA tools don't handle. It drives a patched browser to capture a real token, then serves it over an API, with multi-threaded execution for scraping setups. Smallest project of the group and Turnstile-only, so treat it as a targeted add-on rather than a general solver.
Change Detection
An open-source, self-hosted take on IFTTT or Zapier, with agents that run on your own server. Reach for it when you want web-watching plus arbitrary if-this-then-that automation in one place; the tradeoff is a Ruby app you host and learn. One of the most-starred projects in this category, and still actively maintained.
The de facto way to get a clean RSS feed out of a site that doesn't offer one, backed by a large catalog of community-maintained routes. Best when a route already exists for your target or you're willing to write one; for arbitrary page diffing, changedetection.io is the better fit. Very actively developed.
The default open-source pick for watching a specific page and getting alerted when it changes, and a Visualping alternative you can self-host. Strong on diffing modes and notification targets, with RSS output if you want to pipe changes elsewhere. Actively maintained and widely used.
Same core idea as RSSHub, generating feeds for sites that don't publish them, but written in PHP with its own set of bridges. Pick it if PHP fits your stack or it already has a bridge for your target; RSSHub has the larger route catalog. Still maintained.
The CLI-first option for change monitoring, suited to running from cron and filtering down to the exact part of a page you care about. It can also watch command output, which the GUI-oriented tools can't do. Still maintained, though changedetection.io is the friendlier choice if you want a web UI.
Crawlers & Search
Narrow and practical: point it at a docs site and get a file you can feed into a custom GPT. One of the most-starred projects in this category, though the last commits date to mid-2025. Reach for it when you want LLM context from a known site quickly, not when you need a general-purpose crawler.
Actively maintained and CLI-first, from the ProjectDiscovery team behind a wider security toolset. The headless mode handles JS-heavy sites that simpler crawlers miss. A reasonable default if you want to discover URLs and endpoints from the terminal.
The standard way to scale Scrapy horizontally: swap in the Redis scheduler and run many spiders against one shared queue. Only relevant if you are already on Scrapy and have outgrown a single process. Still maintained, which matters for plumbing this load-bearing.
Built for recon and bug-bounty work rather than data extraction, so it favors speed and link discovery over structured output. Commits have been quiet since late 2024, so check it against your targets. Useful as an attack-surface mapper; for content crawling, look elsewhere.
A long-running take on search that puts the crawler and index on your own hardware and shares results across a P2P network. The appeal is privacy and independence from commercial engines, not result quality at Google scale. For an intranet search appliance or a self-hosted index you control, it is one of the few real options.
A management layer over Scrapyd: deploy, schedule, and watch spiders from a UI instead of the command line. Commits have been quiet since late 2024, so treat the stack versions with care. Worth a look if you run many Scrapy spiders and want a dashboard rather than building your own.
The crawler behind the Wayback Machine and many national web archives, which is the strongest possible credential for preservation work. It writes WARC and is built for fidelity and scale, not quick data extraction. If your goal is a faithful archive rather than parsed fields, start here.
A foundational large-scale crawler with deep roots in the lineage that produced Common Crawl. The Hadoop dependency makes it heavyweight, so it pays off at genuine web scale and feels like overkill below that. Mature and still maintained, but the operational cost is real.
A widely used crawling library for the PHP ecosystem, from a team with a long track record of maintained packages. Concurrency via Guzzle plus optional headless Chrome covers most site-crawl needs. If you are building in PHP, this is the obvious base rather than rolling your own.
A leading Rust option for crawling, built for high concurrency and used as the engine behind spider.cloud. Headless Chrome support and structured output make it a fit for AI data pipelines as well as plain link discovery. Reach for it when raw throughput matters and you are comfortable in Rust.
Purpose-built for news: it combines crawling with article extraction, so you get clean text and metadata without wiring up a separate parser. Common Crawl News integration makes it useful for building large article datasets. The narrow focus is the point; for general crawling it is the wrong tool.
The core replay engine behind many self-hosted web archives, from the Webrecorder project. It serves and records archives rather than discovering content, so it pairs with a crawler rather than replacing one. If you need to host or replay WARC/WACZ collections, this is the reference implementation.
A lower-level, customizable spider for PHP where you wire up your own discovery and persistence rules. It gives you more control than a batteries-included crawler but expects more setup. A reasonable pick for bespoke PHP crawls, though spatie/crawler is the more active default for most projects.
Data Parsing
The convenient default when you need to flatten a pile of mixed file formats into Markdown for a model. It is broad rather than deep, so for heavy PDF and table work Docling preserves more structure, but for everyday document-to-text prep this is the path of least resistance.
The stronger choice when document layout actually matters, with more attention to PDF structure and tables than the lighter converters. Reach for it when MarkItDown's output is too lossy for your RAG pipeline. Heavier to run, but that is the cost of preserving structure.
The Node.js equivalent of Beautiful Soup. Incredibly fast for server-side HTML parsing. Pairs perfectly with Crawlee or raw HTTP requests.
A long-standing pick for building news corpora and pulling article bodies plus metadata in one call. It still gets commits, but the extraction core shows its age against newer alternatives. For pure article-text accuracy on modern pages, trafilatura tends to do better.
The de facto document-querying layer under most Go scrapers. If you know jQuery selectors, you already know the API. There is no real competitor in the Go ecosystem, so this is simply what you use.
The canonical HTML parser for the JVM and the default building block for nearly all Java scraping. Mature, well-documented, and still actively maintained. If you are parsing HTML in Java, the decision is already made.
The reference implementation for pulling clean main-content text out of a cluttered page, and the one many other extractors are measured against. Battle-tested through Firefox, though the standalone repo moves slowly. Solid when you are already in a JS or browser context.
The backbone of Ruby scraping and the parser nearly every Ruby HTTP client hands off to. Fast because it leans on libxml2, with the usual native-extension build friction that comes with that. For HTML and XML work in Ruby there is no serious alternative.
One of the more accurate options for stripping boilerplate and recovering clean article text, which is why it shows up so often in corpus-building and LLM data-prep work. The CLI makes it easy to script over large crawls. If you want article text rather than full DOM access, start here.
A standards-faithful HTML parser for .NET and a common alternative to HtmlAgilityPack. Building a real DOM rather than a loose tree makes selector and LINQ queries predictable. A sensible default for new C# scraping work.
The low-level engine under Cheerio and a good chunk of the Node scraping stack. Streaming and tolerant of malformed markup, so it holds up on large or broken documents. Most people meet it through Cheerio; use it directly when you want speed and control over the parse.
A scriptable CLI for quick extraction jobs, easy to wire into cron for diff-and-notify monitoring. Useful for one-off and shell-driven work rather than as a library foundation. Note that development has been quiet since late 2024, so treat it as stable but not actively evolving.
An older article-extraction library, the Python port of the original Goose project. It still does the basic job of recovering body text and the lead image, but it sees little active development and newer extractors like trafilatura are generally more accurate on today's pages.
The standards-correct tree builder that jsdom and other tooling lean on, matching the HTML Living Standard closely. You rarely use it directly; you use it through whatever depends on it. Reach for it when spec fidelity matters more than convenience.
A common HTML-to-Markdown converter for Go, used in LLM data-prep pipelines that need readable text out of raw pages. The rule system lets you tune the output when the defaults are not quite right. A sensible building block if your stack is already Go.
When Beautiful Soup is too slow. C-backed, XPath support, handles malformed HTML. The performance choice for heavy parsing workloads.
Every Python developer's first scraping tool. Simple, well-documented, battle-tested. For parsing, not crawling.
MCP Servers
The reference implementation and by far the most-starred MCP repo, so it is the right place to see how a server should be built. The web-data pieces (Fetch, Puppeteer) are deliberately minimal starting points, not production scrapers. Reach for a dedicated server like Firecrawl or Playwright MCP once you outgrow them.
Built by Google's Chrome team, which makes it the credible choice when you want an agent to inspect a live, running page and its runtime state. The DevTools angle, performance traces and debugging, is what separates it from a generic browser-control server. Pick it for diagnosing a live site, not bulk scraping.
The de facto standard for giving an agent real browser control, backed by Microsoft and the Playwright project. The accessibility-tree approach is the key design choice: it feeds the model structured page state instead of pixels, which is cheaper and more reliable than vision. If you need an agent to click through a real site, start here.
Works through your existing browser, so the agent inherits your real sessions and cookies instead of spinning up a fresh headless instance. That helps with tasks behind a login, with the obvious tradeoff that you are handing an agent your authenticated browser. Commits have been quiet since early 2026, so check current activity before depending on it.
Same core idea as the other extension-based servers: drive your real browser to keep logged-in sessions and look less like a bot. The repo has seen little public commit activity since spring 2025, so treat it as stable-but-stale and verify it works against current browser versions before building on it.
The cleanest way to put Firecrawl's scraping and crawling behind an agent, and the natural pick if you already use Firecrawl in your data pipelines. It leans on the hosted Firecrawl API, so it is less a self-contained tool than a bridge to that service. That is the point if you want JS rendering and crawl handling done for you.
A focused conversion layer: point it at almost any file or URL and get back Markdown an LLM can read, OCR included. Useful at the front of a RAG pipeline when your sources are messy formats rather than clean HTML. It does one job, so pair it with a real fetcher or crawler for anything beyond conversion.
The official path to Tavily's search and extraction API, which is built for agent and RAG use rather than general web search. It is an API wrapper, so you are signing up for Tavily's service and key. Worth it if you want search results already shaped for an LLM instead of raw SERP HTML.
The no-API-key pitch is the draw: it scrapes several search engines directly, so you can give an agent search without signing up for a paid service. That convenience comes with the usual fragility of engine scraping, where layout changes and rate limits break things. Fine for hobby and local use, less so for anything you need running unattended.
Turns the Apify Actor catalog into agent-callable tools, so instead of writing a scraper you point the model at an existing Actor for social media, maps, search, or e-commerce. The value is tied to Apify's platform and pricing. Reach for it when you want breadth of ready-made scrapers over building and hosting your own.
A small, single-engine alternative to the multi-engine search servers: DuckDuckGo results plus basic page fetching, no key needed. Good when you want the simplest possible search tool for an agent and can live with one engine's coverage and the fragility of scraping it.
A bridge that puts the browser-use autonomous agent behind MCP, with Docker packaging and a VNC view so you can watch what it does. Pick it when you want goal-driven browsing rather than the step-by-step control a Playwright MCP gives. The autonomy is both the feature and the risk.
A more configurable take on the reference Fetch server: choose your output format and set custom headers, which matters when a site needs specific request handling. It is plain HTTP fetching with no JS rendering, so it suits static pages and APIs. For client-rendered sites you will still need a real browser server.
Proxy & Networking
The reference tool for seeing exactly what a site sends and receives over the wire, and for rewriting that traffic on the fly. If you need to reverse-engineer an undocumented API or debug why a scraper is being blocked, start here. It is an inspection and debugging tool, not a production proxy layer.
A well-known classic for assembling free proxy pools, but no longer actively maintained, with no commits since early 2024. Free public proxies are slow and unreliable by nature, so treat this as a reference rather than something to build a pipeline on. For maintained equivalents, look at proxy-scraper-checker or mubeng.
Run it locally and point your scraper at its API to pull validated free proxies on demand. The honest limit is the one every free-proxy pool faces: the underlying IPs are public, churn constantly, and won't survive serious anti-bot defenses. Fine for low-stakes crawling, not for anything you depend on.
Reach for this when you want to build a custom proxy rather than configure an existing one. The plugin model lets you intercept, inspect, or rewrite requests in your own Python, and the zero-dependency footprint keeps it easy to embed. More a framework to build on than a turnkey proxy.
Validate a proxy list, then put mubeng in front of your scraper as a single rotating endpoint. Running it as a local rotator keeps your scraping code unaware of the pool behind it, and the Go implementation makes it quick. A focused tool for the rotate-and-check job.
An actively maintained, Rust-based take on the same job ProxyBroker did: build and prune a free proxy list quickly, with richer filtering and output options. Good for that workflow, though the usual caveat holds: public proxies are only as good as the sources behind them.
The de facto way to route Node requests through a proxy, and a dependency you are probably already pulling in transitively. When you need an HTTP client or headless browser to honor a proxy, this is the layer that does it. Plumbing, not a product, and that is the point.
Maintained by Apify and used inside its scraping stack, including Crawlee, so it is exercised in real workloads. The draw is chaining and upstream auth: wrap an authenticated upstream proxy in a local endpoint a headless browser can use cleanly. The natural pick if you are already in the Apify or Crawlee world.
The standard answer for proxy rotation inside a Scrapy project. Drop in your proxy list and it handles rotation and ban detection without changing your spiders. If you are already on Scrapy, there is little reason to roll your own.
Scraping Frameworks
Hit 31K stars within months of launch. The adaptive selector engine – finds elements even after a site redesign – is something no other framework does. Three fetcher modes, MCP server, BSD-3 licensed.
The OG Python scraping framework. Mature ecosystem, steep learning curve, but nothing matches it for large-scale structured crawling pipelines.
If you're a Go shop, this is your only real option. Fast, concurrent, and well-maintained. Limited compared to Scrapy's ecosystem.
From the Apify team. Best-in-class for JavaScript scraping with built-in Playwright and Cheerio support. The TypeScript-first approach is refreshing.
For years this was the GUI-driven alternative to Scrapy, with a built-in scheduler and result viewer that made it approachable. It is now archived and no longer maintained, so treat it as a reference rather than a base for new work. Reach for Scrapy or Crawlee instead.
Kenneth Reitz's attempt to give requests human-friendly HTML parsing, bundling lxml, pyquery and JS rendering behind one API. It has not seen meaningful updates in years, and the pyppeteer dependency it leans on for JS is itself stale. Fine for quick one-off scripts, but use Playwright or Crawlee for anything you have to maintain.
Essentially Scrapy's architecture rebuilt for the JVM, which makes it the obvious starting point if your stack is already Java. Still maintained, and the component model keeps it readable. If you are not tied to Java, the Python and Node ecosystems move faster.
The Python port of Apify's Crawlee, and one of the more credible modern challengers to Scrapy. It works across raw HTTP, BeautifulSoup, Parsel and Playwright behind one API, with request queues and proxy rotation handled for you, and it is explicitly pitched at feeding AI and RAG pipelines. Actively developed and a sensible default for a new crawler today.
You hand it a sample of what you want and it infers the selectors, which removes the most tedious part of writing a scraper by hand. The tradeoff is fragility: example-driven rules break when page structure shifts, and it does no JS rendering. Best for quick extraction on stable pages, not durable production pipelines.
The actively maintained successor to the original sylvinus/node-crawler, with the queueing and rate-limiting plumbing you would otherwise write yourself. The Cheerio integration is comfortable if you already think in jQuery selectors. For larger or browser-heavy jobs, Crawlee is the broader option.
Ferret's bet is its own query language, FQL, which hides whether a page is static HTML or needs a real browser, so you write one declarative query for both. That is an unusual and genuinely useful abstraction if you can commit to the DSL. Actively maintained, and a good fit for Go teams that want extraction logic kept out of imperative code.
Its composable, selector-driven API makes simple structured scrapes very concise, including pagination following and streaming output. It predates the browser-automation era and does no JS rendering, so it suits server-rendered pages. Still seeing commits, but pair it with a headless browser when sites need one.
Simple alternative to Scrapy for small projects. Wraps requests + BeautifulSoup. Perfect for quick scripts, not for production pipelines.
Where most frameworks treat anti-bot evasion as someone else's problem, Botasaurus builds a humanized, stealthed driver in as a first-class feature. That makes it the obvious reach when Cloudflare is the actual blocker rather than the parsing. As with all evasion tooling, expect breakage as detection vendors update, so treat the bypass as a moving target.
A long-lived staple for Ruby scraping, strongest when the work is stateful: logging in, submitting forms, following links while it tracks cookies for you. It does not run JavaScript, so single-page apps are out of scope. Still maintained, and the natural choice if Ruby is your language.
The classic way to render JavaScript pages inside a Scrapy pipeline, plugging Splash in as the rendering backend. It still receives occasional maintenance, but Splash itself has fallen out of favour against Playwright-based renderers. Useful if you already run Splash; for new Scrapy projects, look at the Playwright integration first.
It brings Scrapy's middleware-and-pipeline model to Go, adding JS rendering and caching, so Go teams get a familiar mental model with native concurrency. A reasonable choice when you want Scrapy's structure without leaving the language. Less actively developed than Ferret, the other notable Go option here.
This is a site-mirroring tool, not a data-extraction framework: it pulls down whole pages with their assets for offline use, with a plugin system for rewriting links and filtering resources. Reach for it when you need a faithful local copy of a site rather than structured fields out of it. Actively maintained and well-scoped for that job.
Roach ports Scrapy's spider, middleware and pipeline model to PHP, and is framework-agnostic with a Laravel adapter for teams already there. It fills a real gap, since PHP has had little in this class. The clear pick if you want to keep scraping inside an existing PHP or Laravel codebase.