Question 1

What is an XML sitemap and what does a sitemap extractor do?

Accepted Answer

An XML sitemap is a file that lists all important URLs on a website along with optional metadata: lastmod (date the page was last changed), changefreq (how often the page changes), and priority (relative importance compared to other pages). Search engines like Google, Bing, and Yandex use sitemaps to discover and crawl pages more efficiently. A sitemap extractor fetches this file, parses its XML, and extracts every URL and metadata field into a readable list that you can filter, sort, and export — making it much easier to audit a site's indexed content than downloading and manually reading the raw XML file.

Question 2

What is a sitemap index file and how does this tool handle it?

Accepted Answer

A sitemap index is a parent XML file that links to multiple child sitemap files rather than listing URLs directly. Large websites use sitemap indexes to split their URL inventory across many files (the Sitemaps protocol limits each file to 50,000 URLs and 50 MB uncompressed). The index file contains sitemapindex elements with loc (child sitemap URL) and lastmod fields. This sitemap extractor automatically detects index files and, when the 'Recurse nested sitemaps' option is enabled, fetches every child sitemap to extract all URLs. The Sitemap Tree tab shows the complete hierarchy of index files and their child urlsets.

Question 3

How does auto-discovery of sitemaps work?

Accepted Answer

When you enter a website's homepage URL (for example https://example.com) instead of a direct sitemap URL, the tool attempts auto-discovery in two stages. First it fetches the site's robots.txt file (https://example.com/robots.txt) and looks for Sitemap: directives, which many sites use to declare their sitemap locations. If one or more sitemaps are declared there, those URLs are used. If robots.txt has no sitemap declarations or cannot be fetched, the tool falls back to probing common sitemap paths: /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, /sitemap/sitemap.xml, and /post-sitemap.xml. The first path that returns a valid sitemap is used.

Question 4

What do lastmod, changefreq, and priority mean in XML sitemaps?

Accepted Answer

These are three optional metadata fields in the Sitemaps protocol. lastmod is the ISO 8601 date-time when the page was last modified — search engines use this to decide whether to recrawl a page. changefreq is a hint about how often the page changes, with values: always, hourly, daily, weekly, monthly, yearly, or never. Google has stated it treats this field as a hint only and does not strictly follow it. priority is a decimal from 0.0 to 1.0 indicating the page's relative importance within the site (default is 0.5). Google has also stated it ignores priority. Despite Google's position, SEO teams use these fields to audit sitemap health and identify outdated or misconfigured entries.

Question 5

How can I export the extracted URLs?

Accepted Answer

After extraction completes, three export buttons appear above the URL table. CSV exports the filtered URL list as a comma-separated file with four columns: URL, Last Modified, Change Frequency, and Priority — ready to open in Excel or Google Sheets. TXT exports one URL per line with no headers, useful for feeding into other tools such as Screaming Frog, Sitebulb, or curl. JSON exports a structured object with a total count and an array of URL objects each containing loc, lastmod, changefreq, and priority fields. The Copy All URLs button copies all filtered URLs to the clipboard as plain text. All exports apply the current filter, so you can filter by keyword or frequency before exporting a subset.

Question 6

Why would a sitemap URL fail to load?

Accepted Answer

Sitemap URLs can fail to load for several reasons. The most common is CORS (Cross-Origin Resource Sharing): browsers block direct requests to third-party domains, so this tool uses a CORS proxy to relay the request. If the proxy itself is blocked by the target server, or if the server returns an error like HTTP 403 (Forbidden) or 404 (Not Found), the fetch will fail. Other causes include the sitemap being behind authentication, the server requiring specific User-Agent headers, the sitemap URL being incorrect, or network timeouts for very large sitemap files. The tool tries multiple CORS proxies automatically before reporting a failure. If extraction fails, try opening the sitemap URL directly in your browser to verify it is accessible.

Question 7

How do I use extracted sitemap URLs for an SEO audit?

Accepted Answer

Extracted sitemap URLs are useful in several SEO audit workflows. You can export the URL list to CSV and import it into Screaming Frog or Sitebulb to crawl only the URLs in the sitemap, then compare sitemap coverage to what is actually crawlable. You can filter by lastmod to find stale pages that have not been updated in over a year. You can look for URLs with no priority or changefreq set, which may indicate the sitemap was auto-generated without customisation. You can check for URLs in the sitemap that redirect (3xx) or return errors (4xx, 5xx), which wastes crawl budget. The Analysis tab gives you an instant breakdown of frequency distribution, priority distribution, URL depth, and file extension types to spot anomalies without leaving the tool.

🗺️ Sitemap Extractor

Error

About This Sitemap Extractor

What This Tool Does

How to Use This Tool

XML Sitemap Complete Reference Guide

What Is an XML Sitemap?

Sitemap Index Files

Sitemap Metadata Fields

SEO Use Cases for Sitemap Extraction