XML Sitemap

An XML sitemap is a structured file — typically at /sitemap.xml — that lists the URLs a site wants search engines and crawlers to discover, along with optional metadata like last-modified date, change frequency, and priority. Sitemaps complement crawling: rather than relying on a crawler to find every page through link traversal, a sitemap explicitly enumerates the canonical URL set. For sites with deep navigation, dynamically generated pages, or content that is otherwise hard to discover, a sitemap is essential for indexation. The XML format is defined by sitemaps.org and supports nested sitemap indexes for sites with more URLs than fit in a single 50,000-URL file. Many CMSes generate sitemaps automatically; static-export frameworks like Next.js expose a `sitemap.ts` convention that emits the file at build time. Search engines also use sitemap modification timestamps as a freshness signal — a sitemap whose lastmod values move forward weekly tells Google the site is being maintained and worth re-crawling. For AI builders running web data pipelines, sitemaps are the cleanest discovery mechanism. Instead of crawling a target site link-by-link and risking exhausting the crawl budget on duplicate or low-value URLs, fetch the sitemap, filter to the URL pattern you care about, and queue those URLs directly. Most well-run sites publish sitemaps; checking robots.txt for the sitemap reference is the standard first step of any scraping project.

Related tools

Related terms