Question 1

What is a robots.txt file and where should I put it?

Accepted Answer

A robots.txt file is a plain text file placed at the root of your website (https://example.com/robots.txt) that tells search engine crawlers which pages or sections they are allowed or not allowed to crawl. It follows the Robots Exclusion Protocol. Each subdomain requires its own robots.txt file. The file must return a 200 status code — a 404 means no restrictions; a 5xx error may cause Google to temporarily stop crawling your entire site.

Question 2

What is the difference between Disallow in robots.txt and the noindex meta tag?

Accepted Answer

Disallow prevents a crawler from visiting a URL but does not prevent that URL from appearing in search results — Google may still index it based on links from other pages. The noindex meta tag tells Google to crawl the page but not include it in search results. Never use Disallow on pages that have a noindex tag, because the crawler cannot read the noindex instruction if it cannot access the page.

Question 3

How do I block AI crawlers like GPTBot and CCBot from my site?

Accepted Answer

Add separate User-agent blocks with Disallow: / for each AI bot: GPTBot (OpenAI), CCBot (Common Crawl), Google-Extended (Google Gemini), anthropic-ai and ClaudeBot (Anthropic), Bytespider (ByteDance/TikTok), PerplexityBot (Perplexity AI). This generator's Block AI Crawlers preset adds all major AI bots automatically. Note that this only affects well-behaved crawlers — malicious scrapers ignore robots.txt.

Question 4

Does robots.txt affect SEO?

Accepted Answer

Yes. Blocking important pages prevents Google from crawling and ranking them. Allowing crawl access to low-value pages (admin panels, duplicate content, search result pages) wastes crawl budget. Common mistakes include accidentally blocking CSS and JavaScript files (which prevents Google from rendering pages), blocking the entire site with Disallow: /, and using robots.txt to hide pages instead of using noindex.

Question 5

What is Crawl-delay and should I use it?

Accepted Answer

Crawl-delay asks crawlers to wait a specified number of seconds between requests. Important caveats: Google ignores Crawl-delay — use Google Search Console crawl rate settings instead. Bing and Yandex respect it. A high Crawl-delay (over 60 seconds) significantly slows indexing. For most websites with adequate server capacity, Crawl-delay is unnecessary.

Question 6

How do wildcards (* and $) work in robots.txt?

Accepted Answer

The asterisk (*) matches any sequence of characters — Disallow: /search* blocks /search, /search?q=, /search/results/, and any URL starting with /search. The dollar sign ($) matches the end of a URL — Disallow: /*.pdf$ blocks all URLs ending in .pdf. Wildcards can be combined: Disallow: /*?*sort= blocks any URL containing the query parameter sort=.

Question 7

Is robots.txt enough to protect private content?

Accepted Answer

No. Robots.txt is advisory, not a security mechanism. It tells well-behaved bots what they should not access, but does not prevent access. Anyone can read your robots.txt and use it as a map of your private areas. To protect private content, use server-side authentication, proper access controls, and move sensitive content off publicly accessible servers. For true privacy, combine server-side authentication with robots.txt and noindex.

🤖 Robots.txt Generator

About This Robots.txt Generator

What This Tool Generates

How to Use This Tool

Robots.txt Complete Reference Guide

Core Directives

SEO Best Practices

Blocking AI Crawlers

Common Mistakes