Enter a sitemap URL or website URL (auto-discovers sitemap)
Try an example:

About This Sitemap Extractor

What This Tool Does

This sitemap extractor fetches and parses any XML sitemap or sitemap index and extracts every URL with its metadata. It supports automatic discovery of sitemap locations from a website's robots.txt, recursive fetching of nested sitemap indexes, and four different views of the extracted data — making it the fastest way to audit a site's full URL inventory without downloading raw XML files.

  • robots.txt auto-discovery — enter a homepage URL and the tool finds the sitemap automatically via robots.txt or common paths
  • Recursive sitemap index support — follows all child sitemaps in a sitemap index file to extract every URL
  • Metadata extraction — captures lastmod, changefreq, and priority fields alongside each URL
  • Deduplication — removes repeated URLs that appear in multiple child sitemaps
  • Configurable limit — optionally cap extraction at 100, 500, 1000, or 5000 URLs
  • 4 result tabs — URL List table, Sitemap Tree hierarchy, Analysis charts, Raw XML viewer
  • Filters and sorting — filter by keyword, change frequency, and priority; sort any column
  • 3 export formats — CSV (with headers), plain TXT (one URL per line), JSON

How to Use This Tool

  • Enter a sitemap URL (ending in .xml) or a website homepage URL — the tool auto-discovers the sitemap
  • Enable Recurse nested sitemaps to follow all child sitemaps in a sitemap index (recommended for large sites)
  • Enable Auto-discover from robots.txt to find sitemap URLs from the site's robots.txt file automatically
  • Enable Deduplicate URLs to remove repeated URLs that appear in multiple child sitemaps
  • Set a URL limit if you only need a sample or want faster results
  • Click Extract URLs — the live progress log shows each sitemap as it is fetched
  • Use the URL List tab to browse, filter, sort, and copy individual URLs
  • Check the Sitemap Tree tab to see the hierarchy of index and urlset files
  • Open the Analysis tab for charts of frequency, priority, URL depth, and file extension distribution
  • Click CSV, TXT, or JSON to download the filtered URL list for use in other tools

XML Sitemap Complete Reference Guide

What Is an XML Sitemap?

An XML sitemap is a file that lists all important URLs on a website along with optional metadata: lastmod (last modification date), changefreq (update frequency hint), and priority (relative importance from 0.0 to 1.0). Search engines like Google, Bing, and Yandex use sitemaps to discover and crawl pages efficiently — especially useful for large sites, new pages without inbound links, or pages behind deep navigation. Sitemaps do not guarantee indexing but significantly improve crawl coverage.

Sitemap Index Files

Large websites split their URL inventory across multiple files using a sitemap index. The index (sitemapindex element) links to child sitemaps (urlset elements), each containing up to 50,000 URLs and a maximum uncompressed size of 50 MB. This tool automatically detects sitemap index files and, when recursion is enabled, fetches every child sitemap to build the complete URL list. The Sitemap Tree tab visualises the full hierarchy including which child sitemaps belong to which index.

Sitemap Metadata Fields

The loc field (required) is the absolute URL of the page. lastmod (optional) is the ISO 8601 date of the last content change — search engines use this to decide whether to recrawl. changefreq (optional) hints at update frequency: always, hourly, daily, weekly, monthly, yearly, or never. priority (optional) is a decimal from 0.0–1.0 indicating relative importance. Google has stated it treats changefreq and priority as hints only and may ignore them, but SEO teams still use them to audit sitemap quality and identify stale or misconfigured pages.

SEO Use Cases for Sitemap Extraction

Extracting sitemap URLs supports several SEO workflows: importing the URL list into Screaming Frog or Sitebulb to crawl only sitemapped pages and check for 4xx, 5xx, or redirects; comparing the sitemap URL list against a crawl to find orphaned pages (indexed but not linked); filtering by lastmod to identify stale content not updated in over a year; auditing priority distribution to see if high-priority pages are correctly marked; and exporting to CSV to track URL count changes over time as a quick measure of content growth or site migrations.

Sitemap Extractor FAQ

An XML sitemap is a file that lists all important URLs on a website along with optional metadata: lastmod (last modification date), changefreq (update frequency hint), and priority (relative importance from 0.0 to 1.0). Search engines use sitemaps to discover pages more efficiently. A sitemap extractor fetches this file, parses its XML, and extracts every URL and metadata field into a readable list you can filter, sort, and export — far faster than downloading and manually reading raw XML.

A sitemap index is a parent XML file that links to multiple child sitemap files rather than listing URLs directly. Large websites use indexes to split their URL inventory across many files (the Sitemaps protocol limits each file to 50,000 URLs and 50 MB). This tool automatically detects index files and, when "Recurse nested sitemaps" is enabled, fetches every child sitemap to extract all URLs. The Sitemap Tree tab shows the full hierarchy of index files and their child urlsets.

When you enter a website homepage URL instead of a direct sitemap URL, auto-discovery works in two stages. First it fetches robots.txt and looks for Sitemap: directives. If found, those URLs are used. If robots.txt has no sitemap declarations or cannot be fetched, the tool probes common paths: /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, /sitemap/sitemap.xml, and /post-sitemap.xml. The first path that returns a valid sitemap is used.

These are optional metadata fields in the Sitemaps protocol. lastmod is the ISO 8601 date the page was last modified — search engines use this to decide whether to recrawl a page. changefreq hints at update frequency: always, hourly, daily, weekly, monthly, yearly, or never. Google treats this as a hint only. priority is a decimal from 0.0 to 1.0 indicating relative importance within the site (default 0.5). Google has also stated it ignores priority. Despite this, SEO teams use these fields to audit sitemap health and identify stale or misconfigured entries.

Three export buttons appear above the URL table after extraction. CSV exports the filtered list with four columns (URL, Last Modified, Change Frequency, Priority) — ready to open in Excel or Google Sheets. TXT exports one URL per line with no headers, useful for Screaming Frog, Sitebulb, or curl. JSON exports a structured object with a total count and an array of URL objects. Copy All URLs copies all filtered URLs to the clipboard as plain text. All exports apply the current filter, so you can filter before exporting a targeted subset.

Sitemap URLs can fail for several reasons. The most common is CORS: browsers block direct requests to third-party domains, so this tool uses a CORS proxy to relay the request. If the proxy is blocked by the target server, or if the server returns HTTP 403 or 404, the fetch fails. Other causes include sitemaps behind authentication, servers requiring specific User-Agent headers, incorrect sitemap URLs, or network timeouts for very large files. The tool tries multiple CORS proxies automatically. If extraction fails, try opening the sitemap URL directly in your browser to verify it is accessible.

Extracted sitemap URLs support several audit workflows: import the CSV into Screaming Frog to crawl only sitemapped pages and check for 4xx/5xx errors or redirects; compare the sitemap URL list to a full crawl to find orphaned pages; filter by lastmod to identify stale content not updated in over a year; look for URLs with no priority or changefreq set indicating an auto-generated sitemap; and check the Analysis tab for instant breakdowns of frequency, priority, URL depth, and file extension types to spot anomalies without leaving the tool.