Open source

Notable repositories

Curated selection of search, extraction, and browser automation repositories. Each entry includes editorial context.

AI Scraping

Firecrawl132K+ stars · TypeScript

Turn websites into LLM-ready markdown with a single API call.

The standard for AI-friendly web scraping. Open-source core, excellent hosted API. If you're building RAG or AI data pipelines, start here.

Browser Use98K+ stars · Python

AI framework that gives LLMs the ability to control a web browser.

The most-starred AI browser tool by a wide margin. Lets AI agents browse the web like a human. The future of web interaction, today.

Crawl4AI68K+ stars · Python

Open-source async crawler optimized for LLM data extraction.

The fastest-growing AI scraping tool. Async-first, produces clean markdown for LLM pipelines. The open-source answer to Firecrawl.

LangExtract36K+ stars · Python

A Python library from Google for extracting structured information from unstructured text using LLMs, with precise source grounding and interactive visualization.

The useful part is source grounding: every extracted field maps back to the exact span it came from, which matters when you need to audit the output. Built around Gemini but applicable to document and text extraction generally. Reach for it when the job is pulling reliable structure out of text you already have rather than scraping live pages.

ScrapeGraphAI27K+ stars · Python

LLM-orchestrated web scraping with automatic graph-based extraction.

Novel approach: uses LLMs to plan and execute scraping tasks. Impressive for complex extraction but LLM costs add up at scale.

Stagehand23K+ stars · TypeScript

Natural language browser automation built on Playwright.

From the Browserbase team. Tell it what to do in plain English, it drives Playwright. Early-stage but the developer experience is unmatched.

Maxun15K+ stars · TypeScript

An open-source, self-hosted no-code platform for web scraping, crawling, and AI data extraction that turns websites into structured APIs.

You build a scraper by pointing and clicking rather than writing selectors, then get an API or spreadsheet out the other end. Best for non-engineers or teams who would rather not maintain scraper code, with the usual tradeoff that point-and-click breaks more easily than hand-written extraction. One of the more popular self-hosted options in this space.

Jina Reader11K+ stars · TypeScript

A service that converts any URL into clean, LLM-friendly Markdown by prefixing it with r.jina.ai, and can be self-hosted.

Prepend a prefix to a URL and get back clean Markdown instead of raw HTML. The obvious comparison is Firecrawl; Reader covers the narrower single-page-to-Markdown job well, with both a hosted endpoint and a self-host option. A reasonable first stop when you just need readable text for a RAG or agent pipeline.

llm-scraper6.8K+ stars · TypeScript

A TypeScript library that turns any webpage into structured data using LLMs, with Zod schemas defining the output and Playwright handling the browser.

You define the output shape with a Zod schema and get typed data back, with the choice of local or hosted models. Because it runs on Playwright, you get real browser rendering rather than raw HTML parsing. The natural pick for TypeScript teams that want schema-validated extraction without leaving their stack.

ContextGem1.8K+ stars · Python

A Python framework for LLM-based structured extraction from documents, aimed at minimal boilerplate.

Document-focused rather than web-focused, with topics pointing at contract analysis and DOCX handling. The pitch is less setup code to get from a document to structured fields. Smaller and newer than the bigger names here, so worth a look when your inputs are files rather than live pages.

Webclaw1.4K+ stars · Rust

Fast, local-first web content extraction for LLMs. CLI, REST API, and MCP server – built in Rust.

Rust-powered Firecrawl alternative with TLS fingerprinting instead of headless browsers. Drop-in v2 API compatibility, 12-tool MCP server. AGPL-3.0 licensed. Early but fast-moving.

Parsera1.3K+ stars · Python

A lightweight Python library for scraping websites with LLMs using minimal code and token-efficient prompts, built on Playwright.

The aim is keeping token usage down while still doing LLM-driven extraction, which matters once you are scraping at volume and watching API costs. Playwright underneath handles the browsing. A reasonable lightweight option for LLM scraping in Python, though it is smaller and has been pushed less recently than the leaders here.

AI Web Agents

AgenticSeek26K+ stars · Python

A fully local autonomous agent that browses the web and writes code without API keys or cloud bills.

A Manus-style agent built to run entirely on your own hardware, trading hosted-model quality for zero per-task cost and full data control. Reach for it when local-only operation is a hard requirement; if you just want the best output, a cloud-model agent will beat it. One of 2025's fastest-growing agent projects.

Skyvern21K+ stars · Python

An LLM- and computer-vision-driven agent that automates browser workflows across websites without per-site selectors.

The selector-free approach is the point: it reads the page like a user instead of breaking every time the DOM shifts, which suits long-tail sites you do not control. A reasonable default for teams who want browser automation that survives layout changes, with the usual vision-agent cost in latency and model spend per run.

nanobrowser13K+ stars · TypeScript

An open-source Chrome extension that runs multi-agent web-automation workflows using your own LLM API key.

A self-hosted alternative to OpenAI Operator that keeps you on your own model keys and inside your own browser session. A good fit if you want Operator-style automation without handing a vendor your accounts and credentials. Check recent commit activity before betting a product on it.

BrowserOS11K+ stars · TypeScript

An open-source agentic browser, built on Chromium, positioned against ChatGPT Atlas, Perplexity Comet, and Dia.

A full agentic browser rather than an extension or library, aimed at people who want the Atlas/Comet experience without a closed vendor. The Chromium base lets you bring your own models and self-host. Worth a look if you are evaluating agentic browsers but want to avoid lock-in.

Magentic-UI9.9K+ stars · Python

An experimental Microsoft Research agent that operates across the browser and the local file system.

A research project from Microsoft built on AutoGen, with a human-in-the-loop focus rather than full autonomy. Treat it as a reference design and a place to study computer-use UX, not a production tool. Useful if you want to see how Microsoft is thinking about agent control surfaces.

LaVague6.4K+ stars · Python

A Large Action Model framework for building web agents that turn natural-language goals into Selenium or Playwright actions.

An early entrant in the natural-language-to-browser-action space, sitting on the Selenium and Playwright stacks most teams already know. Development has been quiet since early 2025, so newer browser-agent frameworks are the safer choice for fresh projects. Still readable as a clean example of the Large Action Model pattern.

Magnitude4.1K+ stars · TypeScript

An open-source, vision-first browser agent for the TypeScript and Playwright ecosystem.

Pitches itself as a vision-first alternative to browser-use and Stagehand, which is the comparison worth running. The vision-first design helps on sites where the DOM is hostile, at the cost of more model calls per step. A reasonable pick for TypeScript teams already on Playwright.

Notte2K+ stars · Python

An open-source framework for building web agents and deploying serverless web-automation functions on managed browser infrastructure.

Aims past the local-script stage at running agents as serverless functions on hosted browser infra, the part most teams underestimate. Of interest if you have outgrown running Playwright on your own boxes and want managed scaling. Smaller and newer than the headline frameworks, so expect a thinner ecosystem.

Tarsier1.8K+ stars · Jupyter Notebook

Vision utilities from Reworkd, including element tagging and OCR, that make webpages legible to multimodal LLM agents.

Not an agent itself but a building block: it tags interactive elements and runs OCR so a vision model can reason about a page. Useful if you are assembling your own agent and need the page-understanding layer. The repo has had no commits since late 2024, so vet it before depending on it.

HyperAgent1.4K+ stars · TypeScript

An AI-native browser-automation framework from Hyperbrowser that extends Playwright with natural-language commands.

Playwright with a natural-language layer on top, so you keep the familiar API and drop the brittle hand-written selectors. The obvious fit for teams already on Hyperbrowser's hosted browser infrastructure. Compare it against browser-use and Stagehand before committing.

Agent-E1.2K+ stars · Python

A hierarchical browser-automation agent from Emergence AI, used as a reference web agent on the WebVoyager benchmark.

A research-grade web agent with a hierarchical planning design, tied to Emergence AI's hosted web-automation API. More valuable as a benchmark reference and an architecture to study than as a drop-in library. Look here if you care about how WebVoyager-style agents are structured.

NativeMind1.1K+ stars · TypeScript

A fully private, open-source browser assistant that runs models on-device for in-page tasks.

The selling point is on-device inference through tools like Ollama, so pages and prompts never leave your machine. The right choice when privacy or offline operation is non-negotiable, at the cost of being capped by whatever model your hardware can run. Less capable than a cloud-model assistant, by design.

BrowserBee978 stars · TypeScript

An open-source, AI-powered browser-assistant extension that drives the page through natural language.

Bills itself as "Cline for web browsing," which captures the idea: a conversational assistant that acts inside your browser rather than a headless automation framework. Better for interactive, ad-hoc tasks than scripted pipelines. Activity slowed in late 2025, so confirm it is still moving before relying on it.

Anti-Detection

FlareSolverr14K+ stars · Python

Proxy server that solves Cloudflare and DDoS-Guard challenges with a real headless browser and hands back the cookies and user-agent so plain HTTP clients can get through.

The pragmatic answer when your scraper or app hits a Cloudflare wall and you don't want to embed a browser everywhere. You run it as a Docker service and point your HTTP client at it. Anti-bot vendors break it periodically, so treat it as a moving target rather than a fixed solution.

camoufox9.2K+ stars · C++

Custom anti-detect build of Firefox with Playwright bindings that spoofs browser fingerprints at the C++ level.

Patching the fingerprint inside the browser binary, rather than injecting JavaScript after the fact, is what separates it from the spoof-by-script tools. One of the more credible open stealth browsers for scraping, and the Firefox-side counterpart to the Chromium undetected drivers. The tradeoff is a heavy custom build you have to keep current with upstream.

puppeteer-extra7.4K+ stars · TypeScript

Plugin framework for Puppeteer (and Playwright via playwright-extra) with stealth and ad-blocking plugins.

The plugin system behind most stealth scraping in Node. Its stealth plugin patches the many signals headless Chrome leaks, and works with both Puppeteer and Playwright. Maintenance has slowed, but it is still the default starting point.

cloudscraper6.6K+ stars · Python

Python requests wrapper that bypasses Cloudflare's JavaScript anti-bot interstitial pages.

A long-standing, lightweight option when you want to stay inside the requests workflow instead of spinning up a browser. It only addresses the JS-challenge layer, so it loses to newer Cloudflare defenses that camoufox or FlareSolverr handle. Commits have slowed, so verify it still clears your target before depending on it.

curl-impersonate6K+ stars · C

curl with browser TLS fingerprints to bypass anti-bot detection.

Clever approach: make curl look like a real browser at the TLS level. Works surprisingly well against Cloudflare and Akamai.

curl_cffi5.8K+ stars · Python

Python HTTP client that binds curl-impersonate to mimic real browser TLS, JA3, and HTTP/2 fingerprints.

Reach for this when blocks happen at the TLS or fingerprint layer and a browser would be overkill. It gives you requests-like ergonomics with a handshake that looks like Chrome or Safari, which clears a large class of fingerprint-based bot walls. Actively maintained, and the most popular Python entry point to the curl-impersonate work.

nodriver4.4K+ stars · Python

Python web automation without a traditional webdriver dependency.

Successor to undetected-chromedriver. Removes the webdriver dependency entirely, making detection even harder.

patchright (Python)3.5K+ stars · TypeScript

Patched, drop-in replacement for Playwright that removes the CDP and runtime leaks bot detectors look for.

If you already have Playwright code, this is the lowest-friction way to make it harder to fingerprint: same API, fewer automation tells. It targets the leak surface (the Runtime.Enable tell and similar) rather than spoofing fingerprints, so pair it with TLS or fingerprint tooling for the full picture.

utls2.4K+ stars · Go

Fork of Go's standard TLS library that exposes low-level control over the ClientHello for fingerprint mimicry.

The dependency under most Go TLS-impersonation clients, including tls-client. Most teams will use it indirectly through a higher-level wrapper rather than wiring ClientHello control by hand; reach for it directly only when you need control the wrappers don't expose. Maintained, with roots in the anti-censorship community rather than scraping.

fingerprint-suite2.4K+ stars · TypeScript

Apify's TypeScript toolkit that generates and injects realistic, internally consistent browser fingerprints into Playwright and Puppeteer.

The fingerprint-consistency piece of an anti-detection stack: it produces headers, navigator properties, and other signals that agree with each other instead of contradicting. Backed by Apify and a sensible fit in a Crawlee-based setup. It handles spoofing, not TLS or CDP leaks, so it covers one layer rather than the whole problem.

CloudflareBypassForScraping2.4K+ stars · Python

Lightweight script that drives a real browser via DrissionPage to pass Cloudflare verification for scraping.

A self-hosted, minimal alternative to running FlareSolverr or paying for a hosted solver, useful for small jobs where you want the bypass logic in your own process. It is a script rather than a managed service, so expect to do more of the upkeep yourself as Cloudflare changes.

surf1.7K+ stars · Go

Go HTTP client with Chrome and Firefox impersonation, HTTP/3 QUIC fingerprinting, and JA3/JA4 TLS emulation.

A more modern take on the Go impersonation client than tls-client, with HTTP/3 and JA4 coverage the older libraries don't all have. The smaller user base is the tradeoff against bogdanfinn/tls-client's more established footprint. Worth a look if you need QUIC-level fingerprinting from Go.

tls-client1.7K+ stars · Go

Go HTTP client built on utls that spoofs browser TLS, JA3, and HTTP/2 fingerprints, with bindings for other languages.

A common building block for fingerprint-resistant scrapers in Go, and usable from other languages through its bindings. It sits a level above utls, so you select a browser profile instead of hand-building a ClientHello. The well-trodden choice for TLS impersonation without a browser; surf is the newer alternative if you need HTTP/3.

puppeteer-real-browser1.6K+ stars · JavaScript

Puppeteer launcher that behaves like a real browser to clear Cloudflare and similar bot-detection captchas while keeping the standard Puppeteer API.

The Puppeteer-side take on the undetected-chromedriver idea: keep your existing automation code, get fewer detection tells. Convenient if you're already on Puppeteer, but commit activity has gone quiet, so confirm it still beats your target's current defenses before committing to it.

CycleTLS1.5K+ stars · Go

Library that spoofs TLS and JA3 fingerprints from both Go and JavaScript.

The reason to pick this over the Go-only clients is the JavaScript surface: it lets Node scrapers present a browser-identical handshake without leaving the JS ecosystem. Actively maintained. If you're purely in Go, tls-client or surf are the more conventional choices.

rebrowser-patches1.4K+ stars · JavaScript

Patches for Puppeteer and Playwright that strip the CDP and runtime leak signals (such as the Runtime.Enable tell) used to fingerprint automation.

A focused fix for a specific, well-documented leak that Cloudflare and DataDome key on, and it can be toggled on or off on demand. It overlaps with patchright in goal; the difference is that this is a patch layer over your existing install rather than a full drop-in fork. Commits have slowed, so check it still covers the current detection signals.

zendriver1.3K+ stars · Python

Async-first, CDP-based undetectable web-automation framework forked from nodriver, with Docker support.

A newer Python option that drives Chrome over the DevTools Protocol with no Selenium or webdriver dependency, which removes a common detection vector. It inherits nodriver's design and adds async ergonomics and containerization. A reasonable pick for Python teams who want a stealth-first browser-driving framework, though it is younger and less proven than the established names.

Browser Automation

Puppeteer94K+ stars · TypeScript

Chrome DevTools Protocol automation for Node.js.

Still the most popular browser automation tool by stars. Playwright is technically superior but Puppeteer's ecosystem is massive.

Playwright90K+ stars · TypeScript

Cross-browser automation library for Chromium, Firefox, and WebKit.

The backbone of modern scraping stacks. Microsoft-backed, fast, reliable. If you're doing JS-rendered scraping, you're probably using this.

agent-browser36K+ stars · Rust

A browser automation CLI from Vercel Labs, written in Rust, for AI agents to drive a real browser.

Built for the agent use case rather than retrofitted from a testing tool. The high star count owes a lot to Vercel's gravity, so judge it on whether the agent-first ergonomics fit your stack rather than on popularity.

Selenium34K+ stars · Java

Browser automation framework supporting multiple languages and browsers.

The grandfather of browser automation. Still relevant for legacy projects and teams with existing Selenium infrastructure. Modern projects should pick Playwright.

lightpanda31K+ stars · Zig

A headless browser written from scratch in Zig for AI and automation workloads, speaking the Chrome DevTools Protocol.

The bet is rebuilding the browser rather than wrapping Chrome, aiming for lower memory and CPU. Because it speaks CDP it can slot in behind Playwright or Puppeteer clients, but a from-scratch engine will lag Chromium on rendering edge cases. Worth a look if browser resource cost dominates your scraping bill.

microsoft/playwright-python14K+ stars · Python

The official Python bindings for Playwright, automating Chromium, Firefox and WebKit through one API.

The default for JS-heavy scraping in Python: cross-browser and maintained by Microsoft. Reach for it over Selenium unless you have a specific reason not to. It does no anti-bot evasion on its own, so pair it with a stealth layer for protected sites.

browserless13K+ stars · TypeScript

Dockerized headless-browser infrastructure that exposes Puppeteer and Playwright over a web service.

Solves the operational half of browser automation: running and pooling browsers at scale instead of writing the scripts. Self-host it or use the hosted cloud. The license is free for non-commercial use only, so check the terms before building a product on it.

chromedp/chromedp13K+ stars · Go

An idiomatic Go package for driving Chrome DevTools Protocol browsers with no external dependencies.

The standard for browser automation in Go: dependency-free and the most-used library in its language. If you are already in a Go codebase this is the natural pick. The main alternative, go-rod, adds higher-level conveniences like auto-waiting, so compare ergonomics first.

undetected-chromedriver12K+ stars · Python

Custom Selenium chromedriver that avoids detection by anti-bot services.

Solves a real problem: getting past Cloudflare and similar anti-bot systems. Fragile by nature (Chrome updates break it regularly) but nothing else does this job.

SeleniumBase12K+ stars · Python

A Python framework for UI testing, web scraping and stealth automation built on Selenium with a CDP-based undetected mode.

The reason scrapers reach for it over plain Selenium or Playwright is its UC/CDP stealth mode against bot detection. The testing-framework heritage means a lot of surface area to learn, but the anti-detection work is the draw if you are hitting protected sites.

Steel Browser7.2K+ stars · TypeScript

An open-source browser API for AI agents with built-in session management, proxies and CAPTCHA handling.

A self-hostable browser backend that bundles the infrastructure most agent projects end up rebuilding: sessions, proxies, CAPTCHA. Closest in spirit to browserless but aimed at the AI-agent workflow. Useful if you want to own the browser layer instead of renting a hosted API.

go-rod/rod7K+ stars · Go

A Chrome DevTools Protocol driver for Go offering high-level web automation and scraping with auto-waiting.

The higher-level alternative to chromedp in Go, with auto-waiting and stealth helpers that cut boilerplate. Pick it over chromedp if you value convenience methods, chromedp if you want a leaner core. Both are actively maintained.

pydoll6.9K+ stars · Python

An async Python library that automates Chromium without a WebDriver, with native CAPTCHA bypass and realistic interactions.

Drives Chromium over CDP directly instead of going through a WebDriver, with anti-detection and CAPTCHA handling built in. A newer, fast-growing entrant in the same space as SeleniumBase's stealth mode. Promising for protected sites, though it has less of a track record than the older frameworks.

CodeceptJS4.2K+ stars · JavaScript

A Node.js end-to-end testing framework with one high-level API over Playwright, Puppeteer and WebDriver backends.

A testing tool first, not a scraping library. Reach for it when you want a readable BDD-style API that abstracts over multiple browser drivers. If your goal is data extraction, the underlying drivers like Playwright are a more direct fit.

Splash4.2K+ stars · Python

A lightweight, scriptable browser-as-a-service with an HTTP API for JavaScript rendering in scraping pipelines.

Once the standard way to add JS rendering to a Scrapy pipeline, exposed as a Lua-scriptable HTTP service. No longer actively developed, with no commits since 2024. New projects should use Playwright or a headless-browser service instead, though it still works for legacy Scrapy setups.

pyppeteer3.9K+ stars · Python

An unofficial Python port of Puppeteer for controlling headless Chromium over the DevTools Protocol.

For years the go-to async Puppeteer-style library for Python, but it has gone quiet, with no commits since 2024. The official playwright-python now covers the same ground with active Microsoft backing. Reach for Playwright on anything new; pyppeteer mostly survives in older codebases.

hardkoded/puppeteer-sharp3.9K+ stars · C#

The official .NET port of Puppeteer for driving headless Chromium and Chrome from C#.

The clear choice for browser automation and dynamic-page scraping in .NET, actively maintained and faithful to the Puppeteer API. If you are in C# there is little reason to look elsewhere.

puppeteer-cluster3.5K+ stars · TypeScript

A library that runs a pool of parallel Puppeteer instances with queuing, retries and error handling.

A focused utility for the concurrency problem: managing many Puppeteer browsers for queued crawl jobs without writing your own pool. Useful glue once you have settled on Puppeteer and need to scale out. It is tied to Puppeteer specifically, so it does not help Playwright users.

CAPTCHA Solving

ddddocr14K+ stars · Python

A pretrained, training-free OCR and object-detection model for recognizing text and slider CAPTCHAs, packaged for pip.

The de-facto open-source CAPTCHA recognizer in the Chinese scraping world, with the star count to match. Best for everyday text and slider CAPTCHAs where you want results without training a model yourself. It won't touch behavioral challenges like reCAPTCHA v3 or Turnstile.

NopeCHA Extension10K+ stars · N/A

A browser extension that solves reCAPTCHA, hCaptcha, FunCaptcha, Turnstile and text CAPTCHAs, with hooks for Selenium, Puppeteer and Playwright.

Broad coverage across the major CAPTCHA types in a single extension, which is rare. The catch is that it's the open-source client for a paid solving service, so the actual solving happens behind an API rather than locally. Reach for it when you need one tool that handles many challenge formats inside an automated browser.

Buster9.1K+ stars · JavaScript

A Chrome, Edge and Firefox extension that solves reCAPTCHA by running its audio challenge through speech-to-text.

Narrow by design and well maintained: it clears reCAPTCHA's audio path and nothing else. That makes it a clean, locally-running option for accessibility and light automation, but it covers only reCAPTCHA, not hCaptcha or Turnstile. The audio route has long been one of the more durable reCAPTCHA bypasses, which explains the following.

captcha_trainer3.2K+ stars · Python

A TensorFlow framework using CNN/ResNet/DenseNet with GRU/LSTM and CTC to train custom image-CAPTCHA recognition models.

The build-your-own option: it trains models, it doesn't solve anything out of the box. Worth it only if you face a bespoke image CAPTCHA that ddddocr can't handle and you have labeled data to train on. For most users the pretrained route is the faster path.

hcaptcha-challenger2.3K+ stars · Python

A Python library that solves hCaptcha image challenges using multimodal LLMs and YOLO models, usable from Playwright.

One of the most-starred open-source hCaptcha solvers, and it has moved toward multimodal models rather than the fragile per-challenge classifiers older solvers relied on. The clear pick if hCaptcha specifically is your blocker and you're already automating with Playwright. LLM-backed solving means trading some latency and API cost for accuracy.

GoogleRecaptchaBypass1.8K+ stars · Python

A Python library that solves reCAPTCHA v2 and v3 via the audio-challenge speech-to-text approach, with DrissionPage and Selenium support.

The same audio-challenge technique as Buster but as a scriptable Python library rather than a browser extension, which fits a scraping pipeline better. Good fit if you're driving DrissionPage or Selenium and need reCAPTCHA cleared in code. Google tightens the audio path periodically, so expect maintenance churn.

Turnstile-Solver829 stars · Python

A Python solver that obtains Cloudflare Turnstile tokens through Patchright/Playwright browser automation and exposes them via an API server.

Focused squarely on Cloudflare Turnstile, which most of the broader CAPTCHA tools don't handle. It drives a patched browser to capture a real token, then serves it over an API, with multi-threaded execution for scraping setups. Smallest project of the group and Turnstile-only, so treat it as a targeted add-on rather than a general solver.

Change Detection

Huginn49K+ stars · Ruby

A self-hosted Ruby platform for building agents that monitor the web, scrape pages, watch feeds, and act or notify on changes.

An open-source, self-hosted take on IFTTT or Zapier, with agents that run on your own server. Reach for it when you want web-watching plus arbitrary if-this-then-that automation in one place; the tradeoff is a Ruby app you host and learn. One of the most-starred projects in this category, and still actively maintained.

RSSHub44K+ stars · TypeScript

An open-source feed generator that turns thousands of sites without RSS into feeds through per-site route adapters.

The de facto way to get a clean RSS feed out of a site that doesn't offer one, backed by a large catalog of community-maintained routes. Best when a route already exists for your target or you're willing to write one; for arbitrary page diffing, changedetection.io is the better fit. Very actively developed.

changedetection.io31K+ stars · Python

A self-hosted tool for website change detection and monitoring, with text, XPath, and JSON diffing, restock and price-drop alerts, and notifications across many channels.

The default open-source pick for watching a specific page and getting alerted when it changes, and a Visualping alternative you can self-host. Strong on diffing modes and notification targets, with RSS output if you want to pipe changes elsewhere. Actively maintained and widely used.

RSS-Bridge9K+ stars · PHP

A self-hosted PHP service that generates RSS, Atom, and JSON feeds for sites that lack them, using maintained per-site bridges.

Same core idea as RSSHub, generating feeds for sites that don't publish them, but written in PHP with its own set of bridges. Pick it if PHP fits your stack or it already has a bridge for your target; RSSHub has the larger route catalog. Still maintained.

urlwatch3.1K+ stars · Python

A configurable command-line tool that watches parts of webpages or command output and notifies you via email, Telegram, and other channels when something changes.

The CLI-first option for change monitoring, suited to running from cron and filtering down to the exact part of a page you care about. It can also watch command output, which the GUI-oriented tools can't do. Still maintained, though changedetection.io is the friendlier choice if you want a web UI.

Crawlers & Search

gpt-crawler22K+ stars · TypeScript

Crawls a site from a starting URL and bundles the content into a knowledge file for building a custom GPT.

Narrow and practical: point it at a docs site and get a file you can feed into a custom GPT. One of the most-starred projects in this category, though the last commits date to mid-2025. Reach for it when you want LLM context from a known site quickly, not when you need a general-purpose crawler.

katana17K+ stars · Go

A Go crawling and spidering framework with headless and JavaScript-aware modes.

Actively maintained and CLI-first, from the ProjectDiscovery team behind a wider security toolset. The headless mode handles JS-heavy sites that simpler crawlers miss. A reasonable default if you want to discover URLs and endpoints from the terminal.

rmax/scrapy-redis5.6K+ stars · Python

Redis-based components that give Scrapy a shared request queue for distributed crawling across workers.

The standard way to scale Scrapy horizontally: swap in the Redis scheduler and run many spiders against one shared queue. Only relevant if you are already on Scrapy and have outgrown a single process. Still maintained, which matters for plumbing this load-bearing.

hakrawler5.1K+ stars · Go

A fast Go crawler for discovering endpoints, assets, and JavaScript sources in a web application.

Built for recon and bug-bounty work rather than data extraction, so it favors speed and link discovery over structured output. Commits have been quiet since late 2024, so check it against your targets. Useful as an attack-surface mapper; for content crawling, look elsewhere.

YaCy4K+ stars · Java

A decentralized peer-to-peer search engine with its own crawler and index, designed to run without a central server.

A long-running take on search that puts the crawler and index on your own hardware and shares results across a P2P network. The appeal is privacy and independence from commercial engines, not result quality at Google scale. For an intranet search appliance or a self-hosted index you control, it is one of the few real options.

Gerapy/Gerapy3.5K+ stars · Python

A distributed crawler management framework built on Scrapy, Scrapyd, Django, and Vue.js, with a web dashboard for deploying and monitoring spiders.

A management layer over Scrapyd: deploy, schedule, and watch spiders from a UI instead of the command line. Commits have been quiet since late 2024, so treat the stack versions with care. Worth a look if you run many Scrapy spiders and want a dashboard rather than building your own.

Heritrix33.2K+ stars · Java

The Internet Archive's open-source, extensible web crawler built for web-scale, archival-quality capture.

The crawler behind the Wayback Machine and many national web archives, which is the strongest possible credential for preservation work. It writes WARC and is built for fidelity and scale, not quick data extraction. If your goal is a faithful archive rather than parsed fields, start here.

apache/nutch3.2K+ stars · Java

An extensible, scalable open-source web crawler in Java, built to run on Hadoop.

A foundational large-scale crawler with deep roots in the lineage that produced Common Crawl. The Hadoop dependency makes it heavyweight, so it pays off at genuine web scale and feels like overkill below that. Mature and still maintained, but the operational cost is real.

spatie/crawler2.8K+ stars · PHP

A concurrent PHP crawler library built on Guzzle that can execute JavaScript via headless Chrome.

A widely used crawling library for the PHP ecosystem, from a team with a long track record of maintained packages. Concurrency via Guzzle plus optional headless Chrome covers most site-crawl needs. If you are building in PHP, this is the obvious base rather than rolling your own.

spider-rs/spider2.5K+ stars · Rust

A low-latency Rust web crawler and data collector with headless rendering and LLM-ready output.

A leading Rust option for crawling, built for high concurrency and used as the engine behind spider.cloud. Headless Chrome support and structured output make it a fit for AI data pipelines as well as plain link discovery. Reach for it when raw throughput matters and you are comfortable in Rust.

news-please2.5K+ stars · Python

An integrated Python crawler and extractor that pulls structured article text and metadata from news sites.

Purpose-built for news: it combines crawling with article extraction, so you get clean text and metadata without wiring up a separate parser. Common Crawl News integration makes it useful for building large article datasets. The narrow focus is the point; for general crawling it is the wrong tool.

pywb1.7K+ stars · JavaScript

A Python web archiving toolkit for recording and replaying WARC and WACZ web archives.

The core replay engine behind many self-hosted web archives, from the Webrecorder project. It serves and records archives rather than discovering content, so it pairs with a crawler rather than replacing one. If you need to host or replay WARC/WACZ collections, this is the reference implementation.

mvdbos/php-spider1.3K+ stars · PHP

A configurable, extensible PHP web spider with depth- and breadth-first traversal, URL filtering, and pluggable discovery and persistence.

A lower-level, customizable spider for PHP where you wire up your own discovery and persistence rules. It gives you more control than a batteries-included crawler but expects more setup. A reasonable pick for bespoke PHP crawls, though spatie/crawler is the more active default for most projects.

Data Parsing

MarkItDown153K+ stars · Python

A Python tool from Microsoft for converting files and Office documents (HTML, PDF, Word, Excel, PowerPoint) to Markdown for LLM ingestion.

The convenient default when you need to flatten a pile of mixed file formats into Markdown for a model. It is broad rather than deep, so for heavy PDF and table work Docling preserves more structure, but for everyday document-to-text prep this is the path of least resistance.

Docling61K+ stars · Python

A document parsing toolkit that converts PDF, HTML, and DOCX into structured Markdown or JSON for RAG and LLM pipelines.

The stronger choice when document layout actually matters, with more attention to PDF structure and tables than the lighter converters. Reach for it when MarkItDown's output is too lossy for your RAG pipeline. Heavier to run, but that is the cost of preserving structure.

Cheerio30K+ stars · TypeScript

Fast, flexible jQuery-like HTML parser for Node.js.

The Node.js equivalent of Beautiful Soup. Incredibly fast for server-side HTML parsing. Pairs perfectly with Crawlee or raw HTTP requests.

codelucas/newspaper15K+ stars · Python

newspaper3k, a Python 3 library for extracting full text, article metadata, and news content from web pages.

A long-standing pick for building news corpora and pulling article bodies plus metadata in one call. It still gets commits, but the extraction core shows its age against newer alternatives. For pure article-text accuracy on modern pages, trafilatura tends to do better.

goquery14K+ stars · Go

A Go library that brings jQuery-style HTML parsing and selection to Go.

The de facto document-querying layer under most Go scrapers. If you know jQuery selectors, you already know the API. There is no real competitor in the Go ecosystem, so this is simply what you use.

jsoup11K+ stars · Java

A Java HTML parser with DOM traversal, CSS-selector extraction, and HTML cleaning for XSS safety.

The canonical HTML parser for the JVM and the default building block for nearly all Java scraping. Mature, well-documented, and still actively maintained. If you are parsing HTML in Java, the decision is already made.

mozilla/readability11K+ stars · JavaScript

A standalone JavaScript version of the article-extraction algorithm behind Firefox's Reader View.

The reference implementation for pulling clean main-content text out of a cluttered page, and the one many other extractors are measured against. Battle-tested through Firefox, though the standalone repo moves slowly. Solid when you are already in a JS or browser context.

nokogiri6.3K+ stars · C

A libxml2-backed Ruby library for parsing HTML and XML with XPath and CSS selectors.

The backbone of Ruby scraping and the parser nearly every Ruby HTTP client hands off to. Fast because it leans on libxml2, with the usual native-extension build friction that comes with that. For HTML and XML work in Ruby there is no serious alternative.

trafilatura6.1K+ stars · Python

A Python library and command-line tool for gathering text and metadata from web pages and feeds, with output to CSV, JSON, HTML, Markdown, TXT, and XML.

One of the more accurate options for stripping boilerplate and recovering clean article text, which is why it shows up so often in corpus-building and LLM data-prep work. The CLI makes it easy to script over large crawls. If you want article text rather than full DOM access, start here.

AngleSharp5.5K+ stars · C#

A .NET library that parses HTML5, MathML, SVG, and CSS into a W3C-spec DOM, queryable with LINQ and CSS selectors.

A standards-faithful HTML parser for .NET and a common alternative to HtmlAgilityPack. Building a real DOM rather than a loose tree makes selector and LINQ queries predictable. A sensible default for new C# scraping work.

htmlparser24.8K+ stars · TypeScript

A fast, forgiving streaming HTML and XML parser for Node.js.

The low-level engine under Cheerio and a good chunk of the Node scraping stack. Streaming and tolerant of malformed markup, so it holds up on large or broken documents. Most people meet it through Cheerio; use it directly when you want speed and control over the parse.

pipet4.7K+ stars · Go

A Go command-line tool for scraping and extracting data from web pages and JSON using HTML, CSS, and JSON selectors.

A scriptable CLI for quick extraction jobs, easy to wire into cron for diff-and-notify monitoring. Useful for one-off and shell-driven work rather than as a library foundation. Note that development has been quiet since late 2024, so treat it as stable but not actively evolving.

goose (python-goose)4.1K+ stars · HTML

A Python library that extracts the main article body, title, and lead image from HTML pages.

An older article-extraction library, the Python port of the original Goose project. It still does the basic job of recovering body text and the lead image, but it sees little active development and newer extractors like trafilatura are generally more accurate on today's pages.

parse53.9K+ stars · TypeScript

A WHATWG HTML5 spec-compliant HTML parsing and serialization toolset for Node.js.

The standards-correct tree builder that jsdom and other tooling lean on, matching the HTML Living Standard closely. You rarely use it directly; you use it through whatever depends on it. Reach for it when spec fidelity matters more than convenience.

html-to-markdown3.7K+ stars · Go

A Go library and CLI that converts HTML into clean Markdown, with rule-based extensibility and support for entire websites.

A common HTML-to-Markdown converter for Go, used in LLM data-prep pipelines that need readable text out of raw pages. The rule system lets you tune the output when the defaults are not quite right. A sensible building block if your stack is already Go.

lxml3K+ stars · Python

High-performance XML and HTML processing library for Python.

When Beautiful Soup is too slow. C-backed, XPath support, handles malformed HTML. The performance choice for heavy parsing workloads.

Beautiful SoupN/A stars · Python

Python library for pulling data out of HTML and XML files.

Every Python developer's first scraping tool. Simple, well-documented, battle-tested. For parsing, not crawling.

MCP Servers

MCP Servers (reference)87K+ stars · TypeScript

The official Model Context Protocol monorepo of reference servers, including the canonical Fetch server (URL to Markdown) and a Puppeteer browser-automation server.

The reference implementation and by far the most-starred MCP repo, so it is the right place to see how a server should be built. The web-data pieces (Fetch, Puppeteer) are deliberately minimal starting points, not production scrapers. Reach for a dedicated server like Firecrawl or Playwright MCP once you outgrow them.

Chrome DevTools MCP43K+ stars · TypeScript

The Chrome team's MCP server that exposes Chrome DevTools to coding agents for browsing, performance tracing, and debugging live web apps.

Built by Google's Chrome team, which makes it the credible choice when you want an agent to inspect a live, running page and its runtime state. The DevTools angle, performance traces and debugging, is what separates it from a generic browser-control server. Pick it for diagnosing a live site, not bulk scraping.

Playwright MCP33K+ stars · TypeScript

Microsoft's official MCP server that lets agents drive a browser through structured accessibility-tree snapshots instead of screenshots.

The de facto standard for giving an agent real browser control, backed by Microsoft and the Playwright project. The accessibility-tree approach is the key design choice: it feeds the model structured page state instead of pixels, which is cheaper and more reliable than vision. If you need an agent to click through a real site, start here.

Chrome MCP Server11K+ stars · TypeScript

A Chrome extension-based MCP server that exposes your real logged-in browser to AI assistants for automation, content analysis, and semantic search.

Works through your existing browser, so the agent inherits your real sessions and cookies instead of spinning up a fresh headless instance. That helps with tasks behind a login, with the obvious tradeoff that you are handing an agent your authenticated browser. Commits have been quiet since early 2026, so check current activity before depending on it.

Browser MCP6.7K+ stars · TypeScript

An MCP server that connects AI applications to your existing local browser through an extension, reusing real sessions and cookies.

Same core idea as the other extension-based servers: drive your real browser to keep logged-in sessions and look less like a bot. The repo has seen little public commit activity since spring 2025, so treat it as stable-but-stale and verify it works against current browser versions before building on it.

Firecrawl MCP Server6.6K+ stars · JavaScript

The official Firecrawl MCP server that adds web scraping, crawling, and search tools to Cursor, Claude, and other MCP clients.

The cleanest way to put Firecrawl's scraping and crawling behind an agent, and the natural pick if you already use Firecrawl in your data pipelines. It leans on the hosted Firecrawl API, so it is less a self-contained tool than a bridge to that service. That is the point if you want JS rendering and crawl handling done for you.

Markdownify MCP2.7K+ stars · TypeScript

An MCP server that converts web pages, PDFs, images, and documents into clean Markdown for LLM ingestion.

A focused conversion layer: point it at almost any file or URL and get back Markdown an LLM can read, OCR included. Useful at the front of a RAG pipeline when your sources are messy formats rather than clean HTML. It does one job, so pair it with a real fetcher or crawler for anything beyond conversion.

Tavily MCP2.1K+ stars · JavaScript

Tavily's official MCP server providing agents with real-time search, extract, map, and crawl tools tuned for LLM consumption.

The official path to Tavily's search and extraction API, which is built for agent and RAG use rather than general web search. It is an API wrapper, so you are signing up for Tavily's service and key. Worth it if you want search results already shaped for an LLM instead of raw SERP HTML.

Open-WebSearch MCP1.4K+ stars · TypeScript

A multi-engine MCP server, CLI, and local daemon that runs agent web search across engines like DuckDuckGo, Bing, and Brave with no API keys.

The no-API-key pitch is the draw: it scrapes several search engines directly, so you can give an agent search without signing up for a paid service. That convenience comes with the usual fragility of engine scraping, where layout changes and rate limits break things. Fine for hobby and local use, less so for anything you need running unattended.

Apify Actors MCP Server1.3K+ stars · TypeScript

Apify's MCP server that exposes thousands of Apify Actors (ready-made scrapers and crawlers) as callable tools for AI agents.

Turns the Apify Actor catalog into agent-callable tools, so instead of writing a scraper you point the model at an existing Actor for social media, maps, search, or e-commerce. The value is tied to Apify's platform and pricing. Reach for it when you want breadth of ready-made scrapers over building and hosting your own.

DuckDuckGo MCP Server1.2K+ stars · Python

A lightweight MCP server providing DuckDuckGo web search plus page-content fetching, with no API key required.

A small, single-engine alternative to the multi-engine search servers: DuckDuckGo results plus basic page fetching, no key needed. Good when you want the simplest possible search tool for an agent and can live with one engine's coverage and the fragility of scraping it.

Browser-Use MCP Server823 stars · Python

An MCP server that wraps the browser-use agent in Docker (with a VNC view) so any MCP client can run autonomous browser tasks.

A bridge that puts the browser-use autonomous agent behind MCP, with Docker packaging and a VNC view so you can watch what it does. Pick it when you want goal-driven browsing rather than the step-by-step control a Playwright MCP gives. The autonomy is both the feature and the risk.

Fetch MCP781 stars · TypeScript

A flexible HTTP fetching MCP server that returns web content as HTML, JSON, plain text, or Markdown with custom headers.

A more configurable take on the reference Fetch server: choose your output format and set custom headers, which matters when a site needs specific request handling. It is plain HTTP fetching with no JS rendering, so it suits static pages and APIs. For client-rendered sites you will still need a real browser server.

Proxy & Networking

mitmproxy43K+ stars · Python

An interactive, TLS-capable intercepting HTTP/HTTPS proxy with a scriptable Python API.

The reference tool for seeing exactly what a site sends and receives over the wire, and for rewriting that traffic on the fly. If you need to reverse-engineer an undocumented API or debug why a scraper is being blocked, start here. It is an inspection and debugging tool, not a production proxy layer.

ProxyBroker4.2K+ stars · Python

An async Python finder, checker, and server for free public HTTP(S) and SOCKS proxies, including a rotating proxy server mode.

A well-known classic for assembling free proxy pools, but no longer actively maintained, with no commits since early 2024. Free public proxies are slow and unreliable by nature, so treat this as a reference rather than something to build a pipeline on. For maintained equivalents, look at proxy-scraper-checker or mubeng.

Scylla4K+ stars · Python

A self-hosted proxy pool that scrapes, validates, and serves free proxies through a local HTTP API.

Run it locally and point your scraper at its API to pull validated free proxies on demand. The honest limit is the one every free-proxy pool faces: the underlying IPs are public, churn constantly, and won't survive serious anti-bot defenses. Fine for low-stakes crawling, not for anything you depend on.

proxy.py3.5K+ stars · Python

A lightweight, zero-dependency, pluggable HTTP/HTTPS proxy server framework in Python with TLS interception.

Reach for this when you want to build a custom proxy rather than configure an existing one. The plugin model lets you intercept, inspect, or rewrite requests in your own Python, and the zero-dependency footprint keeps it easy to embed. More a framework to build on than a turnkey proxy.

mubeng2.1K+ stars · Go

A Go proxy checker and IP rotator that can run as a rotating proxy server in front of a scraper.

Validate a proxy list, then put mubeng in front of your scraper as a single rotating endpoint. Running it as a local rotator keeps your scraping code unaware of the pool behind it, and the Go implementation makes it quick. A focused tool for the rotate-and-check job.

proxy-scraper-checker1.3K+ stars · Rust

An async Rust tool that scrapes and checks HTTP, SOCKS4, and SOCKS5 proxies with filtering and flexible output.

An actively maintained, Rust-based take on the same job ProxyBroker did: build and prune a free proxy list quickly, with richer filtering and output options. Good for that workflow, though the usual caveat holds: public proxies are only as good as the sources behind them.

proxy-agents1.1K+ stars · TypeScript

A Node.js monorepo of HTTP, HTTPS, and SOCKS proxy agents, including https-proxy-agent.

The de facto way to route Node requests through a proxy, and a dependency you are probably already pulling in transitively. When you need an HTTP client or headless browser to honor a proxy, this is the layer that does it. Plumbing, not a product, and that is the point.

proxy-chain1K+ stars · JavaScript

A Node.js proxy server with SSL, HTTP/HTTPS, SOCKS5, authentication, and upstream proxy chaining.

Maintained by Apify and used inside its scraping stack, including Crawlee, so it is exercised in real workloads. The draw is chaining and upstream auth: wrap an authenticated upstream proxy in a local endpoint a headless browser can use cleanly. The natural pick if you are already in the Apify or Crawlee world.

scrapy-rotating-proxies773 stars · Python

A Scrapy downloader middleware that rotates requests across a list of proxies and bans dead ones.

The standard answer for proxy rotation inside a Scrapy project. Drop in your proxy list and it handles rotation and ban detection without changing your spiders. If you are already on Scrapy, there is little reason to roll your own.

Scraping Frameworks

Scrapling63K+ stars · Python

Adaptive web scraping framework with smart element tracking, anti-bot bypass, and stealth browser mode.

Hit 31K stars within months of launch. The adaptive selector engine – finds elements even after a site redesign – is something no other framework does. Three fetcher modes, MCP server, BSD-3 licensed.

Scrapy62K+ stars · Python

Fast, high-level web crawling and scraping framework for Python.

The OG Python scraping framework. Mature ecosystem, steep learning curve, but nothing matches it for large-scale structured crawling pipelines.

Colly25K+ stars · Go

Elegant scraping framework for Go with a clean callback API.

If you're a Go shop, this is your only real option. Fast, concurrent, and well-maintained. Limited compared to Scrapy's ecosystem.

Crawlee23K+ stars · TypeScript

Web scraping and browser automation library for Node.js.

From the Apify team. Best-in-class for JavaScript scraping with built-in Playwright and Cheerio support. The TypeScript-first approach is refreshing.

pyspider16K+ stars · Python

A distributed Python web crawler system with a web UI, scheduler, script editor and result viewer.

For years this was the GUI-driven alternative to Scrapy, with a built-in scheduler and result viewer that made it approachable. It is now archived and no longer maintained, so treat it as a reference rather than a base for new work. Reach for Scrapy or Crawlee instead.

requests-html13K+ stars · Python

A Pythonic HTML parsing layer over requests with CSS and XPath selectors plus JavaScript rendering via pyppeteer.

Kenneth Reitz's attempt to give requests human-friendly HTML parsing, bundling lxml, pyquery and JS rendering behind one API. It has not seen meaningful updates in years, and the pyppeteer dependency it leans on for JS is itself stale. Fine for quick one-off scripts, but use Playwright or Crawlee for anything you have to maintain.

WebMagic11K+ stars · Java

A scalable, modular web crawler framework for Java built around a downloader, scheduler and pipeline architecture.

Essentially Scrapy's architecture rebuilt for the JVM, which makes it the obvious starting point if your stack is already Java. Still maintained, and the component model keeps it readable. If you are not tied to Java, the Python and Node ecosystems move faster.

Crawlee for Python9.2K+ stars · Python

Apify's Python crawling framework that unifies HTTP and headless-browser scraping with auto-scaling, proxy rotation and request queues.

The Python port of Apify's Crawlee, and one of the more credible modern challengers to Scrapy. It works across raw HTTP, BeautifulSoup, Parsel and Playwright behind one API, with request queues and proxy rotation handled for you, and it is explicitly pitched at feeding AI and RAG pipelines. Actively developed and a sensible default for a new crawler today.

AutoScraper7.2K+ stars · Python

A lightweight Python scraper that learns extraction rules from example data you provide.

You hand it a sample of what you want and it infers the selectors, which removes the most tedious part of writing a scraper by hand. The tradeoff is fragility: example-driven rules break when page structure shifts, and it does no JS rendering. Best for quick extraction on stable pages, not durable production pipelines.

node-crawler6.8K+ stars · TypeScript

A Node.js web crawler with server-side jQuery via Cheerio, plus built-in rate limiting, retries and request queueing.

The actively maintained successor to the original sylvinus/node-crawler, with the queueing and rate-limiting plumbing you would otherwise write yourself. The Cheerio integration is comfortable if you already think in jQuery selectors. For larger or browser-heavy jobs, Crawlee is the broader option.

Ferret6K+ stars · Go

A Go-based declarative data extraction engine with its own FQL query language covering both static and browser-rendered pages.

Ferret's bet is its own query language, FQL, which hides whether a page is static HTML or needs a real browser, so you write one declarative query for both. That is an unusual and genuinely useful abstraction if you can commit to the DSL. Actively maintained, and a good fit for Go teams that want extraction logic kept out of imperative code.

x-ray5.9K+ stars · JavaScript

A declarative Node.js scraper with a composable selector DSL that follows pagination and streams results to files or databases.

Its composable, selector-driven API makes simple structured scrapes very concise, including pagination following and streaming output. It predates the browser-automation era and does no JS rendering, so it suits server-rendered pages. Still seeing commits, but pair it with a headless browser when sites need one.

MechanicalSoup4.9K+ stars · Python

Python library for automating interaction with websites.

Simple alternative to Scrapy for small projects. Wraps requests + BeautifulSoup. Perfect for quick scripts, not for production pipelines.

Botasaurus4.8K+ stars · Python

An all-in-one Python scraping framework with built-in anti-detection features aimed at bypassing Cloudflare and similar bot mitigation.

Where most frameworks treat anti-bot evasion as someone else's problem, Botasaurus builds a humanized, stealthed driver in as a first-class feature. That makes it the obvious reach when Cloudflare is the actual blocker rather than the parsing. As with all evasion tooling, expect breakage as detection vendors update, so treat the bypass as a moving target.

Mechanize4.4K+ stars · Ruby

A Ruby library that automates stateful website interaction including forms, links, cookies and history.

A long-lived staple for Ruby scraping, strongest when the work is stateful: logging in, submitting forms, following links while it tracks cookies for you. It does not run JavaScript, so single-page apps are out of scope. Still maintained, and the natural choice if Ruby is your language.

scrapy-splash3.2K+ stars · Python

Scrapy integration for the Splash JavaScript-rendering headless browser service.

The classic way to render JavaScript pages inside a Scrapy pipeline, plugging Splash in as the rendering backend. It still receives occasional maintenance, but Splash itself has fallen out of favour against Playwright-based renderers. Useful if you already run Splash; for new Scrapy projects, look at the Playwright integration first.

Geziyor2.8K+ stars · Go

A concurrent Go web crawling and scraping framework with JS rendering, caching and Scrapy-like middleware pipelines.

It brings Scrapy's middleware-and-pipeline model to Go, adding JS rendering and caching, so Go teams get a familiar mental model with native concurrency. A reasonable choice when you want Scrapy's structure without leaving the language. Less actively developed than Ferret, the other notable Go option here.

node-website-scraper1.7K+ stars · JavaScript

A Node.js tool that downloads an entire website to a local directory, including CSS, images and JS so pages render offline.

This is a site-mirroring tool, not a data-extraction framework: it pulls down whole pages with their assets for offline use, with a plugin system for rewriting links and filtering resources. Reach for it when you need a faithful local copy of a site rather than structured fields out of it. Actively maintained and well-scoped for that job.

Roach1.5K+ stars · PHP

A complete Scrapy-inspired web scraping toolkit for PHP with spiders, middleware and item pipelines.

Roach ports Scrapy's spider, middleware and pipeline model to PHP, and is framework-agnostic with a Laravel adapter for teams already there. It fills a real gap, since PHP has had little in this class. The clear pick if you want to keep scraping inside an existing PHP or Laravel codebase.