Hundreds of Google crawlers are undocumented

Gary Iyesh and Martin Split from Google released a podcast about Googlebot, in which they explained that it is not one independent entity, but hundreds of crawlers for various products and services, most of which are not published in open documentation.

What is Googlebot
The crowling infrastructure has its own name
Hundreds of crawlers that SEO experts don’t know about
The difference between crawlers and fetchers

What is Googlebot

Gary clarified that the name “Googlebot” is a historical name that originated in the early days when Google had only one crawler. This is not the case now.: Google manages many crawlers for different products, but the name “Googlebot” remains, although in fact it is no longer one single thing.

He further explained that Googlebot is not the crowling infrastructure itself or a single system. In fact, Googlebot is a client interacting with a larger internal crowling system, infrastructure.

Martin Split asked:

“How do I introduce Googlebot? What does our crowling infrastructure look like?”

Gary replied:

“I mean, the name “Googlebot” is a misnomer. This is something that used to be normal in the early 2000s, because back then we probably had one crawler – because we had one product. But soon another product appeared, I think it was AdWords. And then we started having new crawlers, new products, and more crawlers, and so on.

But the name “Googlebot”, the way it stuck, is, in fact, a misnomer. Generally, when we talk about our crowling infrastructure as a whole, we used to call it “Googlebot,” but that was very far from the truth, because Googlebot is just one element that interacts with our crowling infrastructure.”

The crowling infrastructure has its own name

Gary further explained that the crowling infrastructure inside Google has an internal name, but he declined to say which one.

He continued:

“Googlebot is not our crowding infrastructure. Our crowling infrastructure does not have an external name. It has an internal name. It doesn’t matter what it is. Let’s call him Jack. And I do not know what to call it correctly — it is a program as a service, if you will. SaaS. Right? And Jack has API endpoints, so to speak. Then you can call these APIs to make a request to get data from the internet.

When you call these APIs, you also need to specify some parameters, such as how long you are willing to wait for a response, which User Agent you want to send, and a token. robots.txt , which you must follow, and all such parameters.

And we usually set default parameters for most of these things, not for all of them, but for most of them, so that you don’t have to specify them all — it simplifies calls, I suppose, because you don’t have to spell out everything completely. But in general, it’s just an API call to the cloud or to a random data center. And they will fulfill the request for you, as for a developer or a product.

This product, so to speak, even if it is inside the company, has been around for a long, long time. … But in fact, he always did the same thing: you tell him to “get something from the Internet” — without destroying the Internet itself. And he does it if the site’s restrictions allow it. That’s all. In short, in one sentence, this is it.”

Hundreds of crawlers that SEO experts don’t know about

Not all Googlebot crawlers are documented — there are many that SEO experts don’t know about. Gary said that many of Google’s internal teams use the crowling infrastructure for their own purposes. He mentioned that there are potentially dozens or hundreds of internal crawlers, but only the main ones are published in the official documentation.

Small or low-load crawlers are often not documented due to practical limitations, but if a crawler becomes large enough, it can be reviewed and documented.

Continuing the topic of multiple clients (crawlers), Gary added:

“…we We try to document most of them, but Google is a big company, and there are many teams that need to get data from the Internet. Therefore, there are many crawlers, many named crawlers, which means that we would need to document dozens, if not hundreds of different crawlers, special crawlers, or ways to obtain data.”

Gary explained that documenting hundreds of crawlers is not easy.

“This is almost impossible on a simple HTML page. Therefore, we try to draw a line and say that if a crawler is very small — that is, it does not receive much data from the Internet — then we will not document it, because the space on the crawlers’ website is developers.google.com , very valuable.

Maybe we’ll approach this issue differently, but so far only the main crawlers and special crawlers or ways of obtaining data are being documented — because, frankly, there just isn’t enough space.”

The difference between crawlers and fetchers

Gary said that there are crawlers and fetchers that fall into the Googlebot category, but in fact they are different things.

Gary explained the difference:

“The simplest explanation is that crawlers work in batches, while fetchers use a separate URL. That is, you give fetcher one URL, and he downloads it. You can’t give him a list of download URLs.

And crawlers are usually a constant stream of URLs that they crawl and collect for your team from the Internet.

We also have a rule inside: fetchers must be user-controlled to some extent. Basically, there’s a person on the other side who’s waiting for a response from Fetcher to know what’s going on.

With crawlers, it’s like: do it if you have free time.”

Martin and Gary point out that there are many crawlers and fetchers used internally, which are not documented.

Gary explained that he has a tool that triggers an alert if any crawler or fetcher exceeds a certain threshold in terms of the number of crawls or downloads per day. Then he contacts the team responsible for this crawler or fetcher to understand what he is doing and why, as well as to check if something is being done by accident. If it’s a crawler that actively collects a lot of URLs, then it decides whether it’s worth documenting it so that the web ecosystem can learn about its existence.

You can listen to the entire podcast about crawlers here (English):

What is Googlebot

The crowling infrastructure has its own name

Hundreds of crawlers that SEO experts don’t know about

The difference between crawlers and fetchers

Leave a Comment Cancel Reply