Google uses three main types:
1. Common Crawlers #
- Standard bots like Googlebot for Search, Google Images, etc.
- Always respect robots.txt.
- Automatic crawling.
2. Special-Case Crawlers #
- Used for specific products with agreements in place.
- Example: AdsBot (for checking ad landing pages).
- May bypass User-agent: * rules with permission.
3. User-Triggered Fetchers #
- Triggered by a user action (not automatic).
- Example: Google Site Verifier (checks ownership).
- Fetch happens on demand.
2️⃣ Technical Properties #
Distributed Crawling #
- Google crawls from many IPs worldwide (mostly US).
- May crawl from other countries if US IPs are blocked.
Protocols Supported #
- HTTP/1.1 (default)
- HTTP/2 (faster, saves resources; opt-out with HTTP 421)
- FTP / FTPS (rare use)
Compression Supported #
- gzip
- deflate
- Brotli (br)
(Specified in Accept-Encoding header.)
3️⃣ Crawl Rate & Host Load #
- Goal: Crawl maximum pages without overloading servers.
- If overloaded → site can reduce crawl rate in Search Console.
- Incorrect HTTP status codes can affect crawl behavior.
4️⃣ HTTP Caching Support #
Google crawlers support caching using:
- ETag & If-None-Match (preferred)
- Last-Modified & If-Modified-Since
💡 Tip:
- Use ETag (no date format issues).
- Correct Last-Modified format: Fri, 04 Sep 1998 19:15:56 GMT
- Optionally set Cache-Control: max-age=<seconds> to hint when to recrawl.
5️⃣ Key Best Practices #
✅ Use correct robots.txt for controlling crawl.
✅ Implement ETag or Last-Modified for efficient recrawls.
✅ Ensure server handles HTTP/2 (unless opting out).
✅ Compress responses (gzip, br) to save resources.
✅ Monitor crawl activity in Search Console → Crawl Stats.
📌 Google’s Common Crawlers (Reference Table) #
Crawler Name | User Agent (Example) | Robots.txt Token | Affected Products |
Googlebot Smartphone | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X…) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | Googlebot | Google Search (Mobile), Discover, Images, Video, News |
Googlebot Desktop | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) | Googlebot | Google Search (Desktop), Discover, Images, Video, News |
Googlebot Image | Googlebot-Image/1.0 | Googlebot-Image | Google Images, Search features with images/logos/favicons |
Googlebot Video | Googlebot-Video/1.0 | Googlebot-Video | Video features in Google Search, video indexing |
Googlebot News | Uses Googlebot UA strings | Googlebot-News | Google News, news.google.com, Google News App |
Google StoreBot | Mozilla/5.0 (X11; Linux x86_64; Storebot-Google/1.0) Chrome/W.X.Y.Z Safari/537.36 | Storebot-Google | Google Shopping (Shopping tab, Shopping surfaces) |
Google-InspectionTool | Mozilla/5.0 (compatible; Google-InspectionTool/1.0;) | Google-InspectionTool | Search Console tools (URL Inspection, Rich Result Test) |
GoogleOther | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X…) (compatible; GoogleOther) | GoogleOther | Generic fetcher for internal research; not used for Search |
GoogleOther-Image | GoogleOther-Image/1.0 | GoogleOther-Image | Fetching publicly accessible images (non-Search) |
GoogleOther-Video | GoogleOther-Video/1.0 | GoogleOther-Video | Fetching publicly accessible videos (non-Search) |
Google-CloudVertexBot | Contains Google-CloudVertexBot in UA | Google-CloudVertexBot | Vertex AI Agents (site-owner requested crawls) |
Google-Extended | Uses existing Google UA; token used for permissions | Google-Extended | Controls if site content can be used to train Gemini models |
Key Notes:
- Chrome/W.X.Y.Z = Placeholder for Chrome version. Always match with wildcard, not exact number.
- All Googlebot variants obey robots.txt unless otherwise agreed (special cases).
- Google-Extended does not affect Search rankings; only AI model training permissions.
📌 Google’s Special-Case Crawlers #
Crawler Name | User Agent (Example) | Robots.txt Token | Notes / Products Affected |
APIs-Google | APIs-Google (+https://developers.google.com/webmasters/APIs-Google.html) | APIs-Google | Push notification delivery via Google APIs (Ignores *) |
AdsBot Mobile Web | Mozilla/5.0 (… Mobile Safari/537.36) (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html) | AdsBot-Google-Mobile | Google Ads ad quality checks for mobile pages (Ignores *) |
AdsBot | AdsBot-Google (+http://www.google.com/adsbot.html) | AdsBot-Google | Google Ads ad quality checks (Ignores *) |
AdSense | Mediapartners-Google | Mediapartners-Google | Google AdSense crawler to deliver relevant ads (Ignores *) |
Google-Safety | Google-Safety | (Ignores robots.txt) | Malware/abuse discovery for links on Google properties |
(Retired) AdsBot Mobile Web (iPhone) | Mozilla/5.0 (iPhone; CPU iPhone OS…) (compatible; AdsBot-Google-Mobile…) | AdsBot-Google-Mobile | Used for iPhone ad quality checks (retired) |
(Retired) Duplex on the Web | Mozilla/5.0 (Linux; Android 11; Pixel 2; DuplexWeb-Google/1.0) | DuplexWeb-Google | Supported “Duplex on the Web” service (retired) |
(Retired) Google Favicon | Mozilla/5.0 (X11; Linux x86_64) … Google Favicon | Googlebot-Image | Favicon fetching (retired; handled by Googlebot-Image) |
(Retired) Mobile Apps Android | AdsBot-Google-Mobile-Apps | AdsBot-Google-Mobile-Apps | Checked Android app page ad quality (retired) |
(Retired) Web Light | Mozilla/5.0 (… googleweblight) Chrome/… Mobile Safari/… | googleweblight | Served lightweight pages under slow network (retired) |
Key Points for FSIDM Students:
- Special-case crawlers may ignore robots.txt (unlike common crawlers).
- They operate from different IP ranges (special-crawlers.json) and have rate-limited-proxy-* hostnames.
- Mostly tied to Google Ads, AdSense, APIs, and safety/security checks.
- Retired crawlers are useful to know for log analysis and historical SEO audits.
📌 Google User-Triggered Fetchers #
Fetcher Name | User Agent (Example) | Purpose / Product |
Feedfetcher | FeedFetcher-Google; (+http://www.google.com/feedfetcher.html) | Crawls RSS/Atom feeds for Google News & PubSubHubbub |
Google Publisher Center | GoogleProducer; (+https://developers.google.com/search/docs/crawling-indexing/google-producer) | Fetches publisher-supplied feeds for Google News landing pages |
Google Read Aloud | Mobile: Mozilla/5.0 (Linux; Android 10; K) … (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943) Desktop: Mozilla/5.0 (X11; Linux x86_64) … (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943) (Former: google-speakr) | On user request, fetches and reads webpages aloud using TTS |
Google Site Verifier | Mozilla/5.0 (compatible; Google-Site-Verification/1.0) | Fetches Search Console verification tokens |
Key Notes for FSIDM Students #
- These fetchers are triggered by a user’s action (not automated bulk crawling).
- They generally ignore robots.txt because the action is intentional by a verified user.
- Operate from user-triggered-fetchers.json IP ranges with hostnames like:
- ***.gae.googleusercontent.com (Google App Engine)
- google-proxy-***.google.com (Google proxy servers)
- ***.gae.googleusercontent.com (Google App Engine)
- Common in server logs during site verification, feed submission, or Google services use.