robots.txt Validator

Enter a domain to fetch and validate its robots.txt file with actionable fix recommendations.

What is robots.txt — and why validate it?

robots.txt is a plain-text file at the root of your site that tells compliant crawlers which URLs they may fetch. It does not hide pages from the public internet; anyone can open https://yourdomain.com/robots.txt. Well-written rules protect staging areas, faceted search traps, and internal tools while keeping important URLs discoverable.

Validation matters because a single typo can block your entire site (for example Disallow: /), or accidentally disallow CSS and JavaScript so search engines render empty pages. AI and classic search crawlers increasingly read the same file, so rules for GPTBot, Google-Extended, and traditional Googlebot should stay intentional and documented.

This tool fetches your live robots.txt over HTTPS, checks syntax and risky patterns, lists declared user-agents and sitemaps, and lets you simulate path-level allow/disallow decisions before you ship changes.

Glossary: core directives

User-agent: Names the crawler group the following rules apply to. * is the wildcard for “all bots” unless a more specific block matches.
Disallow: URL path prefix that must not be fetched. Matching is prefix-based; order and longest-prefix logic matter when multiple rules exist.
Allow: Used mainly by Google to refine Disallow (exceptions inside a disallowed tree). Not every crawler implements Allow the same way.
Sitemap: Optional line pointing to your XML sitemap URL(s). It does not fix crawl issues by itself but helps discovery when crawling is allowed.
Crawl-delay: Non-standard hint for seconds between requests; support varies. Prefer server rate limits and CDN controls for reliable throttling.
Google-Extended: A Google-specific user-agent used to opt out of certain Gemini/training use cases while still allowing standard Googlebot crawling when configured.

Common AI and research crawlers (User-agent)

These strings appear in real robots.txt files. Blocking is voluntary compliance — malicious bots may ignore the file; use auth, WAF, and rate limits for abuse.

Product / family	Typical User-agent	What teams use it for
OpenAI	GPTBot	Training and browsing-related crawling for ChatGPT ecosystem
Anthropic	ClaudeBot	Claude / Anthropic crawler
Google AI	Google-Extended	Opt-out surface for AI training beyond classic indexing
Perplexity	PerplexityBot	Answer engine crawling
Common Crawl	CCBot	Open web corpus used by many research stacks
Meta	FacebookBot	Sharing previews and selected AI/research pipelines

Minimal good vs. dangerous bad example

Reasonable starter pattern

User-agent: *
Disallow: /wp-admin/
Disallow: /private/
Allow: /public/

Sitemap: https://www.example.com/sitemap_index.xml

Dangerous pattern

User-agent: *
Disallow: /

# Blocks every URL for every crawler that obeys robots.txt

Replace paths with your real admin, API, and checkout routes. Test after CDN or host-level redirects — some platforms serve a different robots.txt per hostname.

How to use this validator (workflow)

Enter the canonical hostname — Use the domain customers and Google Search Console use (often www or bare domain). Subdomains need their own robots.txt.
Read errors and warnings first — Fix syntax and semantic issues before tuning Allow/Disallow. A site-wide block or missing sitemap line is higher priority than cosmetic comments.
Run crawl preview on money paths — Check /product/, /checkout/, and key landing URLs to ensure the right user-agent groups can fetch them.
Ship, then re-fetch — After deployment, validate again. CDNs and edge configs sometimes cache or inject robots.txt unexpectedly.
Cross-check in Search Console — Use Google’s robots.txt report and URL inspection to confirm live behavior matches your intent.

Frequently asked questions

What does a robots.txt validator check?▼

It validates syntax (unknown directives, missing colons), semantic issues (site-wide Disallow: /), missing sitemaps, conflicting rules, and file size limits — with actionable fix recommendations.

What is the crawl preview feature?▼

Enter a specific page path (e.g. /products/abc) and the tool simulates which crawlers are allowed or disallowed based on the robots.txt rules, helping you confirm important pages are not accidentally blocked.

Does sharing generate a standalone page?▼

Yes. Each domain auto-generates a static report page (e.g. /toolkit/robots-validator/report/example-com), with both English and Chinese versions indexable by search engines.

Can robots.txt block AI crawlers like ChatGPT?▼

Yes. AI crawlers like GPTBot (ChatGPT), ClaudeBot, and PerplexityBot respect robots.txt directives. You can block them with specific User-agent rules. As of Q1 2026, 25% of top 1,000 sites block GPTBot, up from 5% in early 2023.

Does robots.txt prevent pages from being indexed?▼

No. Robots.txt controls crawling, not indexing. Disallowed URLs can still appear in search results if linked from other sites. Use noindex meta tags to prevent indexing.

How often should I check my robots.txt file?▼

At minimum after every site update, CMS change, or CDN configuration. Best practice: monthly review plus automated change monitoring to catch unauthorized modifications.

What happens if robots.txt is misconfigured?▼

Misconfigured robots.txt can cause significant SEO damage. Real cases show up to 85% page deindexing from blocking CSS/JS files, and 261% traffic recovery after fixing server-injected robots.txt contamination.

Where should I place my robots.txt file?▼

Always at the root of your domain: https://yourdomain.com/robots.txt. Each subdomain (blog.example.com, shop.example.com) needs its own separate robots.txt file.

robots.txt Validator

Validate robots.txt

What is robots.txt — and why validate it?

Glossary: core directives

Common AI and research crawlers (User-agent)

Minimal good vs. dangerous bad example

How to use this validator (workflow)

Frequently asked questions

📖 Learn More About robots.txt

When should I use the robots.txt validator?

Tips

🔍 Key Takeaways

Product

Tutorials

GEO Toolkit