robots.txt: contents of the file The underscore becomes a separator of words for Google
Aug 21

GoogleBot is the name of the robot of indexing of Google. This robot is programmed to function on hundreds of machines at the same time, with different IP addresses. It should be said that it has 3 billion documents to update regularly, and of the million new to discover…
In the “family of GoogleBot” one distinguishes two kinds of robots:

  • Fresh Crawler, whose address IP starts with 64.68.82., corresponds to the robot which indexes the new pages found by Google; once visited by this robot, the pages appear in Google only a few days.
  • Deep Crawler (or Full Crawler), whose address IP starts with 216.239.46., corresponds to the robot which carries out a massive indexing of all the known documents of Google, in general during approximately a week, just after Google Dance.



Fresh Crawler indexes only the documents with formats HTML and text (formats MIME text/HTML and text/lime pit), while Deep Crawler also indexes other types of documents (pdf, PostScript, Word, Excel, PowerPoint…).
Deep Crawler aims to make a massive indexing of each site which he visits. It is difficult to describe according to which algorithm it visits the pages, because that depends on several factors (related to the site) and on the number of robots used to index the site. The principal criteria having an influence on the frequency and the number of visits of a page are PageRank and the frequency of update by the webmaster. It is possible also that the distance (in a number of bonds) from the banner page plays a part.
In order to avoid an overload of your waiter, GoogleBot spaces its visits in time. In addition, it respects the protocol of exclusion of the robots and thus begins any indexing with the consultation of the robots.txt file (if you do not have any, that thus generates errors 404, therefore it is to better put one of them, even if there remains empty).

To know if GoogleBot came on your site, it is enough for you to consult your files log (newspaper of the requests of your site, stored on your waiter). If you do not have access to these files, or if you do not know how to use them, you can use RobotStats. It is about a free application Open Source written in PHP and MySQL, making it possible to analyze in details the visits of Google on your site. From version 2.0, it will be soon possible to manage as many robots as want it to you!

To facilitate the indexing of your site, avoid at all costs passing from the identifiers of session in your URL. Indeed in this case GoogleBot can never finish the indexing of a site, since it obtains a new identifier with each visit (it “thus thinks” of finding a new page).
For the dynamic pages, it is largely recommended to use the technique of the URL rewriting.
Finally you ensure that your site is accessible, if not in the event of visit of GoogleBot during a breakdown, it is likely “to be upset” and not to return more…

Leave a Reply