robots.txt

The robots.txt file is a plain-text file at the root of a web domain that declares which paths automated agents are allowed or disallowed from crawling. It is part of the Robots Exclusion Protocol, an informal standard dating to 1994 and now codified in IETF RFC 9309. Crawlers read robots.txt before requesting any other URL and are expected to honor its directives. Major search engines, well-behaved crawlers, and most commercial scraping platforms respect it; nothing technically prevents an arbitrary client from ignoring it. A robots.txt file consists of user-agent blocks (`User-agent: *`, `User-agent: Googlebot`) and Disallow or Allow directives that scope path patterns. It can also reference sitemaps (`Sitemap: https://example.com/sitemap.xml`) and declare crawl-delay hints, though support for the latter varies. The file expresses preference, not enforcement: it is a courtesy contract between site operators and well-behaved crawlers. For AI builders, the practical question is whether your scraping respects robots.txt. Honoring it is the default ethical choice and is required for most search engine and AI training use cases. Ignoring it may be necessary for legitimate purposes — accessing public data the site has gated through robots — but creates legal and reputational risk. The Hiq Labs vs LinkedIn line of cases established that robots.txt is not a clear legal barrier to scraping public data, but the operational and ethical bar remains: ignore robots.txt only when you have a defensible reason and document why.

Related tools

Related terms