Page loaded

robots.txt Validator

Enter a domain to fetch and validate its robots.txt file with actionable fix recommendations.

Validate robots.txt

Domain length: 0/2048

What is robots.txt — and why validate it?

robots.txt is a plain-text file at the root of your site that tells compliant crawlers which URLs they may fetch. It does not hide pages from the public internet; anyone can open https://yourdomain.com/robots.txt. Well-written rules protect staging areas, faceted search traps, and internal tools while keeping important URLs discoverable.

Validation matters because a single typo can block your entire site (for example Disallow: /), or accidentally disallow CSS and JavaScript so search engines render empty pages. AI and classic search crawlers increasingly read the same file, so rules for GPTBot, Google-Extended, and traditional Googlebot should stay intentional and documented.

This tool fetches your live robots.txt over HTTPS, checks syntax and risky patterns, lists declared user-agents and sitemaps, and lets you simulate path-level allow/disallow decisions before you ship changes.

Glossary: core directives

User-agent
Names the crawler group the following rules apply to. * is the wildcard for “all bots” unless a more specific block matches.
Disallow
URL path prefix that must not be fetched. Matching is prefix-based; order and longest-prefix logic matter when multiple rules exist.
Allow
Used mainly by Google to refine Disallow (exceptions inside a disallowed tree). Not every crawler implements Allow the same way.
Sitemap
Optional line pointing to your XML sitemap URL(s). It does not fix crawl issues by itself but helps discovery when crawling is allowed.
Crawl-delay
Non-standard hint for seconds between requests; support varies. Prefer server rate limits and CDN controls for reliable throttling.
Google-Extended
A Google-specific user-agent used to opt out of certain Gemini/training use cases while still allowing standard Googlebot crawling when configured.

Common AI and research crawlers (User-agent)

These strings appear in real robots.txt files. Blocking is voluntary compliance — malicious bots may ignore the file; use auth, WAF, and rate limits for abuse.

Product / familyTypical User-agentWhat teams use it for
OpenAIGPTBotTraining and browsing-related crawling for ChatGPT ecosystem
AnthropicClaudeBotClaude / Anthropic crawler
Google AIGoogle-ExtendedOpt-out surface for AI training beyond classic indexing
PerplexityPerplexityBotAnswer engine crawling
Common CrawlCCBotOpen web corpus used by many research stacks
MetaFacebookBotSharing previews and selected AI/research pipelines

Minimal good vs. dangerous bad example

Reasonable starter pattern

User-agent: *
Disallow: /wp-admin/
Disallow: /private/
Allow: /public/

Sitemap: https://www.example.com/sitemap_index.xml

Dangerous pattern

User-agent: *
Disallow: /

# Blocks every URL for every crawler that obeys robots.txt

Replace paths with your real admin, API, and checkout routes. Test after CDN or host-level redirects — some platforms serve a different robots.txt per hostname.

How to use this validator (workflow)

  1. Enter the canonical hostnameUse the domain customers and Google Search Console use (often www or bare domain). Subdomains need their own robots.txt.
  2. Read errors and warnings firstFix syntax and semantic issues before tuning Allow/Disallow. A site-wide block or missing sitemap line is higher priority than cosmetic comments.
  3. Run crawl preview on money pathsCheck /product/, /checkout/, and key landing URLs to ensure the right user-agent groups can fetch them.
  4. Ship, then re-fetchAfter deployment, validate again. CDNs and edge configs sometimes cache or inject robots.txt unexpectedly.
  5. Cross-check in Search ConsoleUse Google’s robots.txt report and URL inspection to confirm live behavior matches your intent.

Frequently asked questions

What does a robots.txt validator check?

It validates syntax (unknown directives, missing colons), semantic issues (site-wide Disallow: /), missing sitemaps, conflicting rules, and file size limits — with actionable fix recommendations.

What is the crawl preview feature?

Enter a specific page path (e.g. /products/abc) and the tool simulates which crawlers are allowed or disallowed based on the robots.txt rules, helping you confirm important pages are not accidentally blocked.

Does sharing generate a standalone page?

Yes. Each domain auto-generates a static report page (e.g. /toolkit/robots-validator/report/example-com), with both English and Chinese versions indexable by search engines.

Can robots.txt block AI crawlers like ChatGPT?

Yes. AI crawlers like GPTBot (ChatGPT), ClaudeBot, and PerplexityBot respect robots.txt directives. You can block them with specific User-agent rules. As of Q1 2026, 25% of top 1,000 sites block GPTBot, up from 5% in early 2023.

Does robots.txt prevent pages from being indexed?

No. Robots.txt controls crawling, not indexing. Disallowed URLs can still appear in search results if linked from other sites. Use noindex meta tags to prevent indexing.

How often should I check my robots.txt file?

At minimum after every site update, CMS change, or CDN configuration. Best practice: monthly review plus automated change monitoring to catch unauthorized modifications.

What happens if robots.txt is misconfigured?

Misconfigured robots.txt can cause significant SEO damage. Real cases show up to 85% page deindexing from blocking CSS/JS files, and 261% traffic recovery after fixing server-injected robots.txt contamination.

Where should I place my robots.txt file?

Always at the root of your domain: https://yourdomain.com/robots.txt. Each subdomain (blog.example.com, shop.example.com) needs its own separate robots.txt file.

📖 Learn More About robots.txt

Complete guide to robots.txt syntax, AI crawler blocking strategies, and real-world case studies.

Read the complete guide →

When should I use the robots.txt validator?

Use before site launch, after redesigns, or when adjusting SEO strategy. Quickly catch site-wide blocks, syntax errors, missing sitemaps, and more — with actionable fix recommendations.

Tips

  • Re-validate after every robots.txt change.
  • Prioritize error and warning level issues.
  • Use crawl preview to confirm important paths are properly allowed.
  • Review AI crawler rules regularly to ensure content strategy alignment.

🔍 Key Takeaways

✅ What it does

robots.txt controls crawler crawling behavior, manages crawl budget, and blocks AI training crawlers.

❌ What it doesn't do

Cannot prevent pages from being indexed if linked elsewhere, does not replace noindex tags.

⚠️ 2026 Reality

AI crawlers now account for ~40% of all crawler traffic, yet 73% of sites have no AI crawler rules.

🛡️ Best Practice

Validate after every change, review monthly, set up automated change monitoring.