What is robots.txt — and why validate it?
robots.txt is a plain-text file at the root of your site that tells compliant crawlers which URLs they may fetch. It does not hide pages from the public internet; anyone can open https://yourdomain.com/robots.txt. Well-written rules protect staging areas, faceted search traps, and internal tools while keeping important URLs discoverable.
Validation matters because a single typo can block your entire site (for example Disallow: /), or accidentally disallow CSS and JavaScript so search engines render empty pages. AI and classic search crawlers increasingly read the same file, so rules for GPTBot, Google-Extended, and traditional Googlebot should stay intentional and documented.
This tool fetches your live robots.txt over HTTPS, checks syntax and risky patterns, lists declared user-agents and sitemaps, and lets you simulate path-level allow/disallow decisions before you ship changes.
Glossary: core directives
- User-agent
- Names the crawler group the following rules apply to. * is the wildcard for “all bots” unless a more specific block matches.
- Disallow
- URL path prefix that must not be fetched. Matching is prefix-based; order and longest-prefix logic matter when multiple rules exist.
- Allow
- Used mainly by Google to refine Disallow (exceptions inside a disallowed tree). Not every crawler implements Allow the same way.
- Sitemap
- Optional line pointing to your XML sitemap URL(s). It does not fix crawl issues by itself but helps discovery when crawling is allowed.
- Crawl-delay
- Non-standard hint for seconds between requests; support varies. Prefer server rate limits and CDN controls for reliable throttling.
- Google-Extended
- A Google-specific user-agent used to opt out of certain Gemini/training use cases while still allowing standard Googlebot crawling when configured.
Common AI and research crawlers (User-agent)
These strings appear in real robots.txt files. Blocking is voluntary compliance — malicious bots may ignore the file; use auth, WAF, and rate limits for abuse.
| Product / family | Typical User-agent | What teams use it for |
|---|---|---|
| OpenAI | GPTBot | Training and browsing-related crawling for ChatGPT ecosystem |
| Anthropic | ClaudeBot | Claude / Anthropic crawler |
| Google AI | Google-Extended | Opt-out surface for AI training beyond classic indexing |
| Perplexity | PerplexityBot | Answer engine crawling |
| Common Crawl | CCBot | Open web corpus used by many research stacks |
| Meta | FacebookBot | Sharing previews and selected AI/research pipelines |
Minimal good vs. dangerous bad example
Reasonable starter pattern
User-agent: * Disallow: /wp-admin/ Disallow: /private/ Allow: /public/ Sitemap: https://www.example.com/sitemap_index.xml
Dangerous pattern
User-agent: * Disallow: / # Blocks every URL for every crawler that obeys robots.txt
Replace paths with your real admin, API, and checkout routes. Test after CDN or host-level redirects — some platforms serve a different robots.txt per hostname.
How to use this validator (workflow)
- Enter the canonical hostname — Use the domain customers and Google Search Console use (often www or bare domain). Subdomains need their own robots.txt.
- Read errors and warnings first — Fix syntax and semantic issues before tuning Allow/Disallow. A site-wide block or missing sitemap line is higher priority than cosmetic comments.
- Run crawl preview on money paths — Check /product/, /checkout/, and key landing URLs to ensure the right user-agent groups can fetch them.
- Ship, then re-fetch — After deployment, validate again. CDNs and edge configs sometimes cache or inject robots.txt unexpectedly.
- Cross-check in Search Console — Use Google’s robots.txt report and URL inspection to confirm live behavior matches your intent.
Frequently asked questions
What does a robots.txt validator check?▼
It validates syntax (unknown directives, missing colons), semantic issues (site-wide Disallow: /), missing sitemaps, conflicting rules, and file size limits — with actionable fix recommendations.
What is the crawl preview feature?▼
Enter a specific page path (e.g. /products/abc) and the tool simulates which crawlers are allowed or disallowed based on the robots.txt rules, helping you confirm important pages are not accidentally blocked.
Does sharing generate a standalone page?▼
Yes. Each domain auto-generates a static report page (e.g. /toolkit/robots-validator/report/example-com), with both English and Chinese versions indexable by search engines.
Can robots.txt block AI crawlers like ChatGPT?▼
Yes. AI crawlers like GPTBot (ChatGPT), ClaudeBot, and PerplexityBot respect robots.txt directives. You can block them with specific User-agent rules. As of Q1 2026, 25% of top 1,000 sites block GPTBot, up from 5% in early 2023.
Does robots.txt prevent pages from being indexed?▼
No. Robots.txt controls crawling, not indexing. Disallowed URLs can still appear in search results if linked from other sites. Use noindex meta tags to prevent indexing.
How often should I check my robots.txt file?▼
At minimum after every site update, CMS change, or CDN configuration. Best practice: monthly review plus automated change monitoring to catch unauthorized modifications.
What happens if robots.txt is misconfigured?▼
Misconfigured robots.txt can cause significant SEO damage. Real cases show up to 85% page deindexing from blocking CSS/JS files, and 261% traffic recovery after fixing server-injected robots.txt contamination.
Where should I place my robots.txt file?▼
Always at the root of your domain: https://yourdomain.com/robots.txt. Each subdomain (blog.example.com, shop.example.com) needs its own separate robots.txt file.