1. What is robots.txt
robots.txt is a plain-text rules file at your site root (https://yourdomain.com/robots.txt) that tells crawlers which paths they may fetch and which they may not.
Key boundary: it controls crawling, not indexing. A URL that is disallowed can still appear in search results if other sites link to it.
Google states clearly: to keep a page out of search results, use noindex or access controlsânot robots.txt alone. See Google Search Central: how robots.txt works and its limits.
2. Where the file lives and how it takes effect
- The filename must be lowercase:
robots.txt(case-sensitive). - It must sit at the domain root:
https://yourdomain.com/robots.txt. - Each subdomain needs its own file (e.g.
blog.example.com). - Search engines cache the file; changes usually propagate within a few hours.
Minimal template (three lines to go live)
User-agent: *
Disallow:
Sitemap: https://yourdomain.com/sitemap.xml
This allows all crawlers and points to your sitemap.
3. Core directives (five you should know)
| Directive | Role | Example |
|---|---|---|
User-agent |
Which crawler the rules apply to | User-agent: Googlebot |
Disallow |
Paths crawlers must not fetch | Disallow: /admin/ |
Allow |
Explicit allow within a broader disallow | Allow: /admin/public/ |
Sitemap |
Declares sitemap URL(s) | Sitemap: https://example.com/sitemap.xml |
Crawl-delay |
Delay between requests (limited support) | Crawl-delay: 10 |
- Case-sensitive:
/Photois not the same as/photo. - Paths are prefix-based:
Disallow: /Photocan also block/Photography/. - Google ignores
Crawl-delay; use crawl rate settings in Google Search Console where applicable. - Wildcards
*and end-anchors$are widely supported (not part of the original spec).
4. Three copy-paste templates
Template A: content site (good default)
User-agent: *
Allow: /
Disallow: /search
Disallow: /wp-admin/
Sitemap: https://yourdomain.com/sitemap.xml
Template B: ecommerce (cut parameter churn)
User-agent: *
Allow: /
Disallow: /cart/
Disallow: /checkout/
Disallow: /*?sort=
Disallow: /*?filter=
Sitemap: https://yourdomain.com/sitemap.xml
Template C: AI crawler policy (layer by intent)
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Googlebot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
How to decide: if you want visibility in AI-generated answers, you may allow retrieval-oriented bots (e.g. Google-Extended, PerplexityBot); to protect training data, tighten rules for training-focused crawlers.
5. Pre-launch checks
đ Validate your robots.txt before you ship
Catch syntax issues, conflicting rules, and accidental blocksâplus fix suggestions. Open the robots.txt validator (new tab)
- Can critical pages be crawled?
- Does the sitemap return HTTP 200?
- Are CSS/JS assets unblocked?
- After launch, watch crawl/index trends for 7â14 days.
6. Real cases: when robots.txt goes wrong
Case 1: third-party rules driftââslow leakâ traffic loss
Search Engine Land covered a case where CMS or vendor changes to robots rulesâplus case sensitivityâcaused important URLs to slip from the index over weeks or months, not overnight. Source: Search Engine Land write-up.Case 2: polluted robots response from CDNâstrong recovery after cleanup
After server/CDN injection garbled robots.txt, impressions and indexed URLs tanked; within about three weeks of a clean file and cache purge, indexed URLs rose ~260%, impressions ~261%, and CTR moved from 0.4% to 1.2%. Source: robots.txt error fix case study.7. Screenshots: robots.txt from well-known sites
8. Seven common mistakes and fixes
- Shipping
Disallow: /to production: blocks the whole site. Fix: validate in a tool before deploy. - Blocking render assets (e.g.
/_next/): Google canât render the page. Fix: allow JS/CSS paths. - Disallowing sitemap URLs: slows discovery. Fix: keep sitemap crawlable.
- Using robots.txt as ânoindexâ: disallowed URLs can still be indexed via external links. Fix: use
noindexor auth. - Wrong case or trailing slash:
/adminâ/admin/. Fix: match the paths you intend. - Blocking whole directories by mistake: e.g.
Disallow: /blog/stops all posts. Fix: narrow the path. - No change monitoring: silent edits cause slow index loss. Fix: diffs or alerts.
9. FAQ
Q1: Is robots.txt legally binding?
Usually noâitâs a voluntary convention. Major search crawlers respect it; bad actors may not.Q2: Is robots.txt still relevant?
Yesâespecially with AI crawlers. Itâs the first layer of crawl governance; pair it withnoindex and access control.
Q3: What does âBlocked by robots.txtâ mean?
The URL is not to be fetched per your rules. It may still appear in search if linked elsewhere.Q4: What belongs in a minimal robots.txt?
At leastUser-agent rules, any needed Disallow, and a correct Sitemap line.
Q5: How do I fix âblocked by robots.txtâ errors?
Find the blocking rule â adjust scope â re-test â confirm in Search Console â watch trends for 7â14 days.Q6: Can robots.txt create a security hole?
Not by itself, but it can hint at URL patterns. Do not rely on it for secrecyâuse authentication.Q7: What should I put in the file?
Only crawler directives (User-agent, Disallow, Allow, Sitemap)ânot HTML, marketing copy, or sensitive paths as âsecurityâ.
Reference links
- Google: robots.txt scope and limits
- Moz: robots.txt syntax and best practices
- Cloudflare: robots files and bot behavior
- Yoast: in-depth robots.txt guide
- Search Engine Land: third-party rule drift case
- Case study: recovery after robots.txt errors