📌 Googlebot – Core Crawler for Google Search #
🛠 Types of Googlebot #
- Googlebot Smartphone
- Simulates a mobile user
- Primary crawler for most sites (Google mainly indexes mobile version of content)
- Simulates a mobile user
- Googlebot Desktop
- Simulates a desktop user
- Still used, but less frequently than mobile crawler
- Simulates a desktop user
👉 Note:
Both share the same User-agent: Googlebot in robots.txt, so you cannot block one without blocking the other.
🌐 How Googlebot Crawls Your Site #
- Crawls most sites once every few seconds (on average)
- Can crawl the first 15 MB of HTML or supported text-based files (uncompressed size)
- CSS, JS, images are fetched separately
- CSS, JS, images are fetched separately
- Crawls primarily from US-based IPs (Pacific Time zone)
- Discovers URLs via:
- Internal/external links
- Sitemaps
- Indexed pages
- Internal/external links
🚫 Blocking Googlebot (Crawl vs Index) #
- Stop crawling: Use robots.txt
- Stop indexing: Use noindex
- Block access completely: Password protect or restrict server access
⚠️ Blocking Googlebot affects Google Search, Images, Video, News, Discover
✅ Verifying Googlebot (Avoid Fake Crawlers) #
- Why? Some bots spoof Google’s user agent
- How to verify:
- Reverse DNS lookup → Confirm hostname ends with .googlebot.com or .google.com
- Match IP against Googlebot IP ranges
- Reverse DNS lookup → Confirm hostname ends with .googlebot.com or .google.com
💡 Quick FSIDM Tip for Students:
If your site shows crawling overload or errors in Search Console, it’s often not Googlebot being aggressive—it may be fake bots spoofing Googlebot UA. Always verify before blocking.
📖 What is Google Read Aloud? #
- User Agent Name: Google-Read-Aloud
- Purpose: Reads web pages aloud using Text-to-Speech (TTS)
- When it Activates:
- Only when a user with TTS enabled visits a page
- Only when a user with TTS enabled visits a page
- Where it’s Used:
- Google Go
- Google Read It
- Read Aloud in Google app
- Other Google TTS-enabled services
- Google Go
🔍 Crawl Frequency & Behavior #
- Not a web crawler (no link following)
- Triggered by user action (not automated crawling)
- Caching:
- Google caches content to save bandwidth
- But multiple requests for the same page may still occur
- Google caches content to save bandwidth
🚫 How to Block or Control It #
- Robots.txt → ❌ Does NOT work (because it’s user-triggered)
Block completely:
<meta name=”google” content=”nopagereadaloud”>
Paywalled or subscription content:
Use structured data to mark restricted content:
“isAccessibleForFree”: false
📜 Old vs New User Agent #
- Current: Google-Read-Aloud
- Old (deprecated): google-speakr
💡 FSIDM Pro Tip for Students:
If you run a membership site, premium articles, or gated content, always use nopagereadaloud or mark isAccessibleForFree:false in structured data — otherwise, Google’s TTS may read it aloud to users for free.
🌐 What is APIs-Google? #
- User Agent Name: APIs-Google
- Purpose:
- Delivers push notifications for Google APIs
- Lets apps avoid constant polling of Google servers
- Delivers push notifications for Google APIs
- Why Ownership Verification?
- Google ensures the developer owns the domain before allowing push notifications
- Prevents abuse
- Google ensures the developer owns the domain before allowing push notifications
📡 How APIs-Google Accesses Your Site #
- Uses HTTPS POST requests to send push notifications
- Retries failed requests:
- Exponential backoff retry schedule (up to several days)
- Exponential backoff retry schedule (up to several days)
- Traffic pattern:
- Can be steady (if updates are regular)
- Or spiky (if resources change rapidly or retries happen often)
- Can be steady (if updates are regular)
⚙️ How to Prepare Your Site #
- SSL Certificate Required (must be valid)
- ❌ No self-signed certs
- ❌ No certs from untrusted authorities
- ❌ No revoked certificates
- ❌ No self-signed certs
- Respond quickly to notifications (within seconds) to avoid repeated retries
🚫 How to Block APIs-Google #
- Unregister notifications:
- Contact whoever registered the push endpoint and disable it
- Contact whoever registered the push endpoint and disable it
Robots.txt:
User-agent: APIs-Google
Disallow: /
- (Note: It ignores Googlebot rules, only responds to APIs-Google token)
✅ Verifying APIs-Google Requests #
- Spoof check:
- Look up source IP of request claiming APIs-Google
- Reverse DNS lookup → Must resolve to googlebot.com or google.com
- Look up source IP of request claiming APIs-Google
💡 FSIDM Tip for Developers:
If APIs-Google traffic is hitting your site too often, it’s usually due to:
- A misconfigured app not acknowledging notifications
- A push endpoint responding too slowly
👉 Fixing those saves bandwidth and keeps Google from hammering your server.
📌 What is Feedfetcher? #
- User Agent Name: Feedfetcher-Google
- Purpose:
- Retrieves RSS & Atom feeds for Google News and PubSubHubbub
- Only podcast feeds may appear in Google Search results
- Retrieves RSS & Atom feeds for Google News and PubSubHubbub
- Trigger: Activated only when a user adds a feed to a service/app (not an automated crawl)
⚡ How Feedfetcher Works #
- Requests feeds on behalf of the user
- Ignores robots.txt (because it acts like a direct user request, not a crawler)
- Stores & periodically refreshes feeds for efficiency
- Often shares the same retrieved feed with multiple users to save bandwidth
📈 Frequency of Retrieval #
- Usually ≤ 1 request per hour per feed
- Popular/frequently updated feeds may refresh more often
- Network delays can cause short bursts of requests
🚫 Blocking Feedfetcher #
Since robots.txt doesn’t work:
- Return HTTP status codes for Feedfetcher-Google requests:
- 404 Not Found
- 410 Gone
- 404 Not Found
- If hosted on a platform (like Blogger, WordPress, etc.), configure restrictions through the platform settings
🔍 Why It Might Fetch “Odd” or “Secret” URLs #
- A user manually typed or bookmarked the feed URL
- A user shared the link
- Feedfetcher does not guess URLs—it only fetches the one explicitly provided
📊 Technical Details #
- Distributed across multiple machines → requests may come from different Google IPs
- IP ranges: Listed in user-triggered-fetchers-google.json
- Fetches exact URL only (does not follow links like Googlebot)
💡 FSIDM Tip for Students/Marketers:
If you run a podcast, news site, or blog:
- Ensure your RSS/Atom feeds are valid & clean (follow spec)
- Feedfetcher helps deliver your content faster to apps & news surfaces
- Don’t block it unless you have a business reason (e.g., premium content)