Google’s crawlers follow the Robots Exclusion Protocol (REP) to check which parts of a website they can crawl.
📌 What robots.txt Does #
- It tells crawlers (like Googlebot) which parts of the site they can or cannot crawl.
Example:
User-agent: *
Disallow: /private/
User-agent: Googlebot
Allow: /includes/
Sitemap: https://example.com/sitemap.xml
📍 File Location Rules #
- Must be placed in the root directory of the site:
- https://example.com/robots.txt ✅
- https://example.com/pages/robots.txt ❌
- https://example.com/robots.txt ✅
- Applies only to that specific host, protocol, and port.
- https://example.com/robots.txt → applies to https://example.com/ only.
- https://m.example.com/robots.txt → separate rules for mobile subdomain.
- https://example.com/robots.txt → applies to https://example.com/ only.
📏 Google’s Key Interpretations #
- Case-Sensitive URL
- /Private/ ≠ /private/ (URLs are treated as exact matches).
- /Private/ ≠ /private/ (URLs are treated as exact matches).
- Allow vs Disallow Priority
- Google applies the most specific rule to a path.
- Google applies the most specific rule to a path.
Example:
Disallow: /includes/
Allow: /includes/css/
- → CSS inside /includes/css/ will still be crawled.
- Comments (#) Are Ignored
- Use # for explanations in file (doesn’t affect crawling).
- Use # for explanations in file (doesn’t affect crawling).
- Google Requires Resources
- Don’t block important .css or .js if they are needed to render your pages properly.
- Don’t block important .css or .js if they are needed to render your pages properly.
- Sitemap Tag is Optional
- Adding Sitemap: https://example.com/sitemap.xml helps Google discover pages faster.
- Adding Sitemap: https://example.com/sitemap.xml helps Google discover pages faster.
⚠️ What robots.txt Does NOT Do
- ❌ Does not hide content from search results
- Blocked URLs can still appear if other sites link to them.
- Use noindex meta tag or password-protection to fully hide.
- Blocked URLs can still appear if other sites link to them.
- ❌ Not security → It’s public. Anyone can see your robots.txt.
💡 FSIDM Quick Tip for Students & Site Owners:
Think of robots.txt as a traffic cop — it directs crawlers, but it doesn’t lock doors. If you need real privacy, use authentication or noindex.
📌 Valid robots.txt URL Rules (Google’s View) #
The robots.txt file only applies to the exact protocol, domain/subdomain, and port it’s hosted on.
Robots.txt Location | Valid For | Not Valid For |
https://example.com/robots.txt | https://example.com/ | https://other.example.com/, http://example.com/, https://example.com:8181/ |
https://www.example.com/robots.txt | https://www.example.com/ | https://example.com/, https://shop.www.example.com/ |
https://example.com/folder/robots.txt | ❌ Crawlers don’t check subdirectories | All |
https://www.exämple.com/robots.txt | https://www.exämple.com/, https://xn--exmple-cua.com/ | https://www.example.com/ |
ftp://example.com/robots.txt | ftp://example.com/ | https://example.com/ |
https://212.96.82.21/robots.txt | https://212.96.82.21/ | https://example.com/ |
https://example.com:443/robots.txt | https://example.com/, https://example.com:443/ | https://example.com:444/ |
https://example.com:8181/robots.txt | https://example.com:8181/ | https://example.com/ |
💡 FSIDM Tip: Each subdomain needs its own robots.txt if you want to control crawling separately.
⚡ Handling HTTP Status Codes for robots.txt #
Google treats robots.txt responses differently based on status code:
HTTP Code | Google’s Behavior |
2xx (Success) | Reads and applies rules normally. |
3xx (Redirects) | Follows up to 5 hops → then treats as 404. |
4xx (Client Errors) | Treated as no robots.txt (full crawl allowed), except 429 (rate limit). |
5xx (Server Errors) | – First 12h: Stops crawling.- Next 30 days: Uses cached version if available.- After 30 days: If site available → crawls as if no robots.txt. |
DNS/Network Errors | Treated as 5xx. |
⏱ Caching Rules #
- Google caches robots.txt for up to 24 hours.
- May cache longer if site is unreachable.
- Can refresh manually via Search Console robots.txt tester.
📏 Format & Size Rules
- Must be UTF-8 encoded plain text.
- Max size: 500 KiB (content after that is ignored).
- Invalid lines (e.g., HTML, bad encoding) are ignored.
💡 FSIDM Takeaway for Students & SEO Managers:
👉 Correct location, format, and encoding are just as important as the rules themselves.
👉 Always test robots.txt in Search Console after uploading to avoid indexing issues.
🛠 Robots.txt Syntax Basics #
- Format:
<field>:<value> # optional comment
- Spaces are optional but recommended for readability.
- # starts a comment (ignored by crawlers).
📌 Supported Fields (Google) #
Field | Purpose | Example |
user-agent | Specifies the crawler the rules apply to | User-agent: Googlebot |
disallow | Path not allowed to crawl | Disallow: /private/ |
allow | Path allowed to crawl (overrides disallow) | Allow: /private/public-page.html |
sitemap | Location of sitemap(s) | Sitemap: https://example.com/sitemap.xml |
❌ Fields like crawl-delay are not supported by Google.
🔍 Path Rules #
- Paths are case-sensitive. (/File.asp ≠ /file.asp)
- Paths must start with / (root relative).
- Wildcards supported:
- * matches any sequence of characters.
- $ matches end of URL.
- * matches any sequence of characters.
🧠 User-Agent Selection Logic #
Google picks the most specific group for the crawler:
- Example:
User-agent: googlebot-news # Group 1
User-agent: * # Group 2
User-agent: googlebot # Group 3
- Googlebot-News → Group 1
- Googlebot → Group 3
- Other Google bots → Group 2
📌 Order in file doesn’t matter. Google groups all relevant rules for a user agent internally.
📂 Grouping Rules #
Multiple user-agents can share rules:
User-agent: e
User-agent: f
Disallow: /g
→ Both e and f follow /g restriction.
📜 Example of Correct Syntax #
# Block all bots from /private/
User-agent: *
Disallow: /private/
# Allow Googlebot access to /private/reports/
User-agent: Googlebot
Allow: /private/reports/
# Add sitemap location
Sitemap: https://example.com/sitemap.xml
💡 FSIDM Practical Tip for Students
- Test in Search Console after every change.
- Keep rules minimal → complex rules increase errors.
- Always combine Disallow + Allow smartly for sections.
🚦 URL Matching Based on Path Values in robots.txt #
Google uses the path part of a URL (after domain name) to decide if a robots.txt rule applies. It compares this path to the allow and disallow rules.
🎯 Key Wildcards Supported: #
Wildcard | Meaning | Example Match |
* | Matches 0 or more characters | /fish* matches /fish.html, /fishheads, etc. |
$ | Matches end of URL | /*.php$ matches /index.php but not /index.php?x=1 |
📌 Examples of Matching Rules #
Rule | Matches | Doesn’t Match |
/ | The root and everything below it (whole site) | — |
/fish | /fish, /fish.html, /fish/salmon.html, /fish.php?id=anything | /Fish.asp (case-sensitive), /catfish, /desert/fish |
/fish/ | Anything inside /fish/ folder, e.g. /fish/salmon.htm, /fish/?id=anything | /fish (without slash), /fish.html |
/*.php | Any URL containing .php, e.g. /index.php, /folder/filename.php?params | /windows.PHP (case sensitive) |
/*.php$ | URLs ending exactly with .php e.g. /file.php, /folder/file.php | /file.php5, /file.php?param |
/fish*.php | URLs containing /fish followed by .php somewhere, e.g. /fish.php, /fishheads/catfish.php | /Fish.PHP (case sensitive) |
⚖️ Order of Precedence — Which Rule Wins? #
- Google uses the most specific (longest matching) rule for a URL.
- If there are conflicting rules, Google applies the least restrictive rule (i.e., allows crawling if possible).
🔥 Real-World Examples #
URL | Rules | Which Rule Applies? | Why? |
https://example.com/page | allow: /pdisallow: / | allow: /p | /p is more specific than / |
https://example.com/folder/page | allow: /folderdisallow: /folder | allow: /folder | In conflict, Google picks least restrictive rule |
https://example.com/page.htm | allow: /pagedisallow: /*.htm | disallow: /*.htm | Longer, more specific disallow rule applies |
https://example.com/page.php5 | allow: /pagedisallow: /*.ph | allow: /page | Least restrictive rule wins |
https://example.com/ | allow: /$disallow: / | allow: /$ | $ means exact root, more specific |
https://example.com/page.htm | allow: /$disallow: / | disallow: / | allow: /$ only matches root URL, not /page.htm |
💡 FSIDM Pro Tip #
When writing rules:
- Use /folder/ to block a directory’s content only.
- Use /file.ext to block specific files.
- Use wildcards * and $ wisely to cover patterns.
- Always test your robots.txt with Google Search Console to verify rule behavior!