Web Scraper

A web scraper is a program that extracts specific data from web pages. Where a crawler discovers URLs, a scraper takes URLs (or HTML directly) and pulls out the fields you care about — product price, article body, structured listing data, or whatever the target schema requires. Traditional scrapers use CSS selectors or XPath expressions to navigate the DOM and extract data; modern scrapers increasingly use LLMs to extract structured output from unstructured pages, which is robust against site redesigns but expensive at scale. The technical surface of a scraper covers three concerns: fetching the page (HTTP, headless browser, or scraping API), parsing the HTML (Beautiful Soup, Cheerio, lxml, or LLM-based), and persisting the result (CSV, database, vector store). Every scraper makes tradeoffs across all three. Cheap, fast scrapers fetch over plain HTTP and parse with selectors; expensive, robust scrapers fetch through a headless browser, parse with an LLM, and self-heal when the layout changes. For AI builders, the build-versus-buy decision is the central one. Tools like Firecrawl, ScrapegraphAI, Diffbot, and Kadoa sell the entire pipeline as a service with LLM-based extraction. DIY frameworks like Scrapy and Crawlee give you control but require ongoing maintenance. The right choice depends on volume, target stability, and how much engineering time you can spend on the data layer.

Related tools

Related terms