Agentic Extraction
Agentic extraction is an approach to web data collection where an AI model actively navigates, interprets, and extracts information from…
Glossary
Key terms in AI search, web data, and the infrastructure that powers LLMs and AI agents.
Agentic extraction is an approach to web data collection where an AI model actively navigates, interprets, and extracts information from…
Agentic web extraction is the broad category of web data collection where an AI agent – not a hand-written scraper – decides what to…
An AI agent is a software system built around a language model that can autonomously plan, execute multi-step tasks, and interact with…
An AI data pipeline is the end-to-end system that collects, processes, and delivers external data to an AI application.
AI Overview is Google's name for the AI-generated summary that appears at the top of certain search result pages.
An AI search API is a web service that lets applications query the internet and receive results optimized for consumption by large…
Anti-bot detection is the layer of defenses websites use to identify and block automated traffic.
Browser fingerprinting is the technique of identifying a browser by combining many small, individually unremarkable signals into a…
CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart.
Client-side rendering is the pattern where the server ships a near-empty HTML shell and the browser's JavaScript renders the actual page…
The context window is the maximum amount of text – measured in tokens – that a language model can process in a single request, including…
Crawl budget is the term for the maximum number of pages a search engine or other crawler will fetch from a given site within a time window.
A CSS selector is a string syntax for matching elements in an HTML document, originally designed for stylesheets but widely used for DOM…
Data freshness refers to how current the information is that an AI system has access to when generating responses.
A datacenter proxy is an IP address allocated to a server in a commercial cloud or hosting provider.
Embeddings are numerical vector representations of text (or images, audio, and other data) produced by neural networks.
Generative Engine Optimization, or GEO, is the practice of optimizing content to be cited and synthesized by AI search and AI Overview…
Grounding is the practice of anchoring a language model's output to verifiable external sources.
Hallucination prevention encompasses the techniques and system design patterns used to reduce the rate at which AI models generate…
A headless browser is a real web browser – usually Chromium, Firefox, or WebKit – running without a graphical user interface.
A honeypot trap is a hidden element placed on a web page specifically to catch automated clients.
An HTML parser is a library that turns raw HTML text into a navigable tree structure – a DOM or DOM-like object – so your code can query…
JavaScript rendering, often abbreviated JS rendering, refers to executing the JavaScript on a web page so the resulting DOM contains the…
LLM-ready data is web content that has been cleaned, structured, and formatted for direct consumption by a large language model.
LLM-ready Markdown is the specific output format that modern scraping APIs produce for AI pipelines: a single Markdown document per…
The llms.txt file is a proposed standard for sites to publish a Markdown-formatted summary of their most important content for…
The Model Context Protocol, or MCP, is an open standard introduced by Anthropic in late 2024 that defines how AI models connect to…
A mobile proxy routes requests through a real mobile carrier IP – typically a 4G or 5G connection on a phone or USB modem.
Online-Mind2Web is the live-web extension of the Mind2Web benchmark, introduced to evaluate web agents on real, public websites rather…
OSWorld is a benchmark for computer-use agents, released in 2024 by researchers from the University of Hong Kong, Salesforce Research…
A proxy pool is a managed collection of proxy IP addresses that a scraper or scraping platform draws from to spread requests.
Rate limiting is the practice of capping how many requests a client can make to a server within a time window.
Real-time web access is the capability of an AI system to retrieve current information from the internet at the moment a user asks a…
A residential proxy is an IP address assigned by a consumer internet service provider to a real home or mobile device, then routed…
Retrieval-augmented generation, or RAG, is an architecture pattern where a language model's response is informed by external documents…
The robots.txt file is a plain-text file at the root of a web domain that declares which paths automated agents are allowed or…
A rotating proxy is a proxy service that automatically swaps the outbound IP address on every request, or on a configured time interval.
Search-augmented generation is a specific form of RAG where the retrieval step queries a live search engine rather than a static…
Semantic search is a retrieval method that finds results based on the meaning of a query rather than exact keyword matches.
A SERP API is a service that programmatically retrieves search engine results pages and returns the data in a structured format –…
Server-side rendering is the pattern where the server generates the full HTML for a page before sending it to the client.
Stealth mode, in the context of web automation, refers to a set of patches applied to headless browsers to hide the fact that they are…
Structured output refers to an LLM's ability to generate responses in a specific, machine-readable format – typically JSON, but also…
TLS fingerprinting identifies a client by the precise pattern of cipher suites, extensions, and elliptic curves it advertises in its TLS…
Tool use, also called function calling, is the ability of a language model to invoke external functions or APIs as part of generating a…
A user agent is the string a client sends in the `User-Agent` HTTP header to identify itself.
A vector database is a storage and retrieval system optimized for high-dimensional numeric vectors – the embeddings produced by models…
Vector search is the retrieval technique that finds the most semantically similar items to a query by comparing high-dimensional…
A web browsing agent is an AI system that can autonomously navigate, interact with, and extract information from websites using a real…
A web crawler – sometimes called a spider or bot – is a program that systematically discovers URLs and downloads pages.
A web data extractor (often called a web scraper, the legacy term) is a program that extracts specific data from web pages.
A web index is a searchable database of web pages that has been built by systematically crawling and processing the internet.
WebArena is an academic benchmark released in 2023 by researchers at Carnegie Mellon University for evaluating autonomous web agents on…
An XML sitemap is a structured file – typically at /sitemap.xml – that lists the URLs a site wants search engines and crawlers to…
XPath is a query language for navigating XML and HTML documents.