Google & The Effect Of Sand Box. robots.txt: contents of the file
Aug 20

The referencing of dynamic Web sites is one of the principal sources of interrogations of the current webmasters. After behaving a long time been a factor completely blocking for the search engines, the situation was softened for some time. This article approaches the technique of the URL rewriting which constitutes certainly the best solution to obtain a good referencing of the dynamic Web sites.


The dynamic Web sites are sometimes obstacles for the search engines, but the situation rather tends to improve. However, sometimes, certain pages represent a real obstacle for the robots and it is necessary to intervene, to implement a specific procedure, to make so that a site is indexed by the “Google and consort”. Among the solutions at disposal of the reference, that of the URL rewriting (or “rewriting of URL”) seems best and, in any case, the most effective.

Note: this article is inspired by the contents of the chapter “the referencing of dynamic Web sites” of the book “Google, tricks of pros” published at Micro Application. Its contents were however re-examined to adapt it in this article.

The dynamic sites generate most of the time pages with the URL long and complex, because of presence of variables. With dynamic pages, only one file makes it possible to dynamically create hundreds or thousands of pages. Once the rewriting of URL is installation on a site, the pages will be accessible thanks to URL “clean”, that it is for the Net surfers or the robots of the search engines.

Thus, a page which was accessible to the address:

http://www.4engr.com/articles/article.php?id=12&page=2&rubrique=5

will be accessible after URL rewriting to the address (for example):

http://www.4engr.com/articles/article-12-2-5.html

These URL known as “clean” (removed from special characters like “?” or “&”) the indexing of the dynamic sites, and thus their referencing in the engines facilitate.

In addition to this undeniable advantage, the rewriting of URL also makes it possible to reinforce the safety of the site by masking the names of the variables passed in the URL. If the extension of the URL “clean” is neutral (for example .html or .htm), it is even possible to mask the language used on waiter (PHP in our example).

The principle of the rewriting of URL is thus to set up a “system” on the waiter so that it can interpret this new format of URL. In our example, when a visitor reaches the http://www.4engr.com/articles/article-12-2-5.html, page the waiter must return the same thing exactly as if the visitor had asked to reach the http://www.4engr.com/articles/article.php?id=12&page=2&rubrique=5 page.

The correspondence between the two diagrams of URL is then described in the form of “rewriting rules”. Each rule makes it possible to describe a format of URL. In the example above, the rewriting rule will indicate to the waiter to take the first number like number of article, the second like number of page and the third like number of heading.

The technique of the most known rewriting of URL is that available on the waiters Apache, generally used with language PHP. Except special mention, all the examples of this article will thus be devoted to language PHP and the Apache waiter.

The first thing to be made is obviously to make sure that the waiter which lodges your site makes it possible to use the rewriting of URL. All depends, initially, of the type of waiter used. The object of this article is not to review all the types of waiters but here a summary of the possibilities of rewriting of URL on the most current Web servers:

Web server Support of the rewriting of URL Details
Apache Managed by the module mod_rewrite, a standard module of Apache starting from version 1.3.27 The module mod_rewrite must be active:the file of configuration of Apache (httpd.conf) must contain this line:LoadModule rewrite_module libexec/mod_rewrite.solike this one:AddModule mod_rewrite.c
IIS (Microsoft) In ASP: possible rewriting by filters ISAPI, marketed by various companies (paying). The parameter setting of the rewriting rules is specific to each component.
In ASPX (.NET), on all the supported waiters: functions are available like RewriteURL (), etc which deal with the rewriting of URL. Codes ready to compile to exploit these capacities are provided by Microsoft (example) or via open projects source like codeproject.comNo standard method was designed to define the rewriting rules but a practical use consists in parameterize them directly in the web.config (file of configuration of the ASP.Net application, present in particular at the root of the site), which is standardized in XML.

If your site is lodged on a dedicated waiter, you have access yourself to the configuration of the waiter. In the case of an Apache waiter, you can thus modify the file of configuration in order to activate the support of the rewriting of URL. Think of starting again Apache after having modified the file of configuration.

But it is not all. If your site is lodged on a mutualized waiter, it is not guaranteed that your shelterer activated the support of the rewriting of URL, mainly for reasons of safety.

If your site is provided by a free sheltered, there are few chances that the rewriting of URL is possible. I strongly advise you to invest in a paying lodging (in addition to one domain name), the advantages are really numerous to carry out a good referencing.

If you wish to choose a good shelterer, trust me and take Sivit, which lodges WebRankInfo since years. Is it necessary to specify that the URL Rewriting is possible in all the cases of figure?

To check if the module mod_rewrite of Apache is activated, it is enough for you to follow the following points:

  1. Create a repertory named test which you will place at the root of your site (it will be thus accessible via the http://www.4engr.com/test/ address.
  2. In this repertory, create a file HTML named test.html (http://www.4engr.com/test/test.html) containing only the following lines:
    <html> <head> <title>Test</title> </head> <body> OK! </body> </html>
  3. In this repertory, create a file named .htaccess containing the following lines (I will see their significance further):
    Options +FollowSymlinks RewriteEngine one RewriteRule ^inconnu  .html$ /test/test.html [L]
  4. Transfer this repertory and these two files on your site, then go to the http://www.4engr.com/test/inconnu.html address

You undoubtedly expect that the navigator posts an error message indicating that the file named inconnu.html does not exist at this place on your site (error 404). If it is the case, then your shelterer undoubtedly does not authorize the rewriting of URL: contact it to ask him.

If you are at a free shelterer who does not manage it, it is a very good reason to cross the course and to profit from all the advantages of a professional lodging (which is from now on financially accessible to all).

If not, you should see text “OK! ”, which means that while requiring to see the file inconnu.html (who does not exist physically on the waiter), the waiter posts you the contents of the file test.html (who exists well). It is the principle even of the rewriting of URL and thus the proof that your waiter manages the rewriting of URL well. It “rewrote”, in our case, “inconnu.html” in “test.html”. CQFD.

There can be a third case of figure (that I do not wish you…): your site is completely blocked, no page can be posted, and you have a message indicating “Error 500”. In this case do not panic, it is enough for you to withdraw the file .htaccess which is incompatible with your shelterer.

To define the diagrams of URL

Let us take again our example of site which has a data base of articles, and draw up the list of the types of URL. Here are some examples:

  • article.php? id=12&rubrique=5
  • article.php? id=12&page=2&rubrique=5

To simplify the reading, I listed here only of URLs concerning the same article, but in practice when you try to draw up the list of the types of URL on your site, you can fall, for example, on:

  • article.php? id=182&rubrique=15
  • article.php? id=36&page=5&rubrique=3

The principle of the URL rewriting consists in finding the diagrams of the URL starting from their common forms. In our example, the articles are accessible according to two types of URL (id+rubric or id+rubrique+page), according to whether the number of page is specified or not.

From the moment when you identified these “diagrams of URL”, you must choose a new format of URL (the “clean” URL). In general one reveals a file name with the extension .html (or .htm) but will know that you can put what you want, that does not affect no the taking into account of the pages by Google. Indeed, whatever the extension that you will have chosen, the page will remain a page respecting standard HTML.

The name of the file will be made of a prefix and/or a suffix, and values of the variables (that they are figures or letters). Benefit from this stage for reflecting well according to referencing, because you can use key words in the URL of your pages, which are more speaking for the Net surfers and undoubtedly taken into account by the search engines.

Here new formats of URL which I chose for each one of the URL of the preceding examples:

  • article-12-5.html
  • article-12-2-5.html
  • article-182-15.html
  • article-36-5-3.html

To separate the various parts of the URL, you must choose a separator (in this article, I choose to use only indents). It is more effective for referencing to choose a character which is regarded as a separator of words by Google. Thus, your URL will be able to contain key words, which is taken into account without concern by Google.

You can nevertheless also use the following characters:

  • The indent: -
  • The comma: ,
  • The point: .
  • The bar obliques (slash): /
  • The vertical bar (pipe): |

I disadvise to you using the following characters:

  • The low indent (underscore): _
  • The sign sharp: #
  • The esperluette: &
  • The arobase: @
  • The question mark: ?
  • The sign dollar: $
  • Accentuated characters, space

The indent and the comma are simplest; the oblique bar can pose problems of repertories and the vertical bar is not very known Net surfers. The low indent (underscore) poses concern with Google.

I thus defined two formats of URL for our heading of posting of the articles. Let us try to formalize them by removing the numbers of articles, headings or pages, and by replacing them by their significance:

  • article-ARTICLE-RUBRIQUE.html
  • article-ARTICLE-PAGE-RUBRIQUE.html

Of course, ARTICLE, HEADING and PAGE represent numbers here.

To write the rewriting rules

Now that I determined the various diagrams of URL, it remains to write the rewriting rules which will indicate to the waiter how to interpret each one of these diagrams.

Let us pass directly to the solution which I will comment on… Here contents of the file .htaccess located in our http://www.notre-site.com/articles/ repertory:

#————————————————–
# Repertory: /articles/
#————————————————–
# the waiter must follow the bonds symbolic systems: Options +FollowSymlinks
# Activation of the module of rewriting of URL: RewriteEngine one
#————————————————–
# Rewriting rules of URL:
#————————————————–
# Article without number of page: RewriteRule ^article- ([0-9] +) - ([0-9] +) .html$ /articles/article.php? id=$1&rubrique=$2 [L]

# Article with number of page: RewriteRule ^article- ([0-9] +) - ([0-9] +) - ([0-9] +) .html /articles/article.php? id=$1&page=$2&rubrique=$3 [L]
Note: there should not be carriage return on a line of rewriting rule.

The lines starting with the sign sharp (#) are comments. Do not hesitate to add some to make your files more comprehensible: these lines are completely ignored by the module of rewriting of URL.

Each file .htaccess is specific to a repertory; I took the practice to indicate in top of this file the site of the repertory on the site. Each repertory of your site will have to thus propose its own file .htaccess.

The first two instructions (Options +FollowSymlinks and RewriteEngine one) should be present only only once by file, before any rewriting rule.

  • The instruction Options +FollowSymlinks is optional but can be useful in certain configurations.
  • The instruction RewriteEngine one indicates that I wish to use the module of rewriting of URL. If you have a problem with a rewriting rule which you have just added, you can decontaminate in a few seconds the rewriting of URL time to include/understand the problem: it is enough for you to off write RewriteEngine in the place of RewriteEngine one.

The continuation of the file consists of a series of rewriting rules. Each rule is written on only one line (except complex rules) and respects the following format:

RewriteRule URL_A_REECRIRE URL_REECRITE

Explanations:

  • RewriteRule is a key word specific to the module mod_rewrite which indicates that the line defines a rewriting rule.
  • Then the URL has suddenly rewritten, i.e. the “clean” URL without physical existence on the waiter.
  • Finally the URL comes rewritten, i.e. the URL such as it will be called in-house on the waiter.
  • These 3 elements must be written on only one line, and separated by one or more spaces each time

The format of the URL to be rewritten is based on the regular expressions, whose base will have to be acquired to be able to define rewriting rules. You do not worry, for the majority of the cases it is very simple.

Here the list of the elements taken into account in the rewriting rules:

Element Explanations
^ Indicate the beginning of the URL to récrire. This character is optional but it is more rigorous to use it.
article This part of the URL is not used directly. I could have written art in the place of article. This prefix can be used to differentiate various diagrams from URL, and it makes it possible the Net surfer to better include/understand the object of the page.
() The brackets are used to frame a variable whose value is recovered in the 3rd part of the line.
[0-9] + Indicate that the variable is made up of one or more figures.
$ Indicate the end of the URL to be rewritten. This character is optional but it is more rigorous to use it.
/articles/ This part is sometimes optional (that depends on the configuration of the waiter). In general it is enough to relatively indicate the site of the file to the repertory in which the file .htaccess is located (thus one can do without this element). On certain mutual zed shelterers (OVH or Sivit to quote only two of them), you must show the complete way towards the file, starting from the root of the site, as in our example.
article.php Name of the file which the waiter must use to post the page. It is the name of a file which exists physically and which contains a script (PHP in our example) of management of the dynamic page.
? Obligatory character preceding the series by variables passed in the URL rewritten.
id=$1 Indicate that the named variable id will take the value located in the first pair of brackets.
& Character used to separate 2 variables in the URL rewritten.
rubrique=$2 Indicate that the named variable heading will take the value located in the second pair of brackets.
[L] Flag (option) meaning “Last”, indicating to the module rewriting that it must stop. More precisely, if the URL of the page required by the visitor corresponds to the diagram defined by this rule, then the module of rewriting of URL should not examine the other rules located in the remainder of the file .htaccess. It is not always obligatory but it will not make evil!

This example of rewriting rule already makes it possible to manage our heading of articles, but there are other more complex rules.

 

If you more wish concerning the formatting to use to write these rules:

  • ([0-9] {1,2}) one or two digits Authorizes
  • ([0-9] *) Authorizes all the figures, as many time as one wants
  • (- [a-z] *) Authorizes all the letters and indents, as many time as one wants
  • etc

I invite you to consult a tutorial on regular expressions Perl here for example.

To modify all the internal bonds

Now that I defined the diagrams of URL and created the rewriting rules, it remains to be checked that in all the site, all the bonds use the good diagram of URL.

Indeed, the rewriting rules of the file .htaccess do not suffice so that all your site is with the new format, with URL clean! It is with you to change the way of writing the bonds, whether it is in static pages or dynamic pages.

Of course, you must be able to jump this stage if you include the management of the rewriting of URL as of the creation of the site, since you will have taken care to generate as of the beginning of the bonds with the good formats.

To update and test

It is time to test! Transfer all the files modified on line, including the file .htaccess, then go in your navigator to test if the rewriting functions.

To take again our example, compare what you obtain while going on:

http://www.4engr.com/articles/article-12-2-5.html

and on:

http://www.4engr.com/articles/article.php?id=12&page=2&rubrique=5

You should have the same page exactly…

In the event of complete blocking of the site (for example with an error of the type 500), do not forget that it is enough to remove the file .htaccess (or to cancel the last modifications) so that all returns in the order.

I advise you to use a software of checking of the bonds inside your site (you can for example choose Xenu’ S Link Sleuth, with a software to be installed under Windows. This type of software acts like Googlebot: it traverses your pages while following all the bonds which it finds. If it does not find any bond dead (a driving bond in an untraceable page), then you made any error neither in your rewriting rules nor in your internal bonds. If not, correct consequently.

Once you set up the rewriting of URL on all your site, there remain two stages to finish the optimization of your site (from the point of view of referencing):

  1. To create bonds towards all the pages.
  2. To optimize the code of each dynamic page.

1. To create bonds towards all the pages

The dynamic sites very often comprise a great number of pages. The installation of the rewriting of URL allows a good indexing, but it is not a sufficient condition so that all the pages of your site are indexed. Indeed, it is necessary to create the conditions so that the robots of the engines can reach your pages while following the bonds present on your site.

  • If you have a heading containing of the articles (topicality for example), envisage a zone of files with bonds towards all the articles, hierarchical in a chronological way.
  • If you have a forum with thousands of discussions, check that all the bonds which make it possible to sail of page on page use the good format of URL. You can also there too envisage a heading of files, with bonds towards all the forums and all the discussions of the forums, the whole distributed on as many pages as necessary (you limit to a hundred bonds per page approximately, possibly a little more).
  • If you have a catalogue of products, you classified certainly these products of categories, on one or more levels. Present these products in the shape of a directory which makes it possible to sail in all the catalogue with traditional bonds <a href>. This directory can be supplemented by an internal search engine, often very appreciated Net surfers, and compatible of course with your catalogue of products.
  • Plane page a “of the site” adapted can also be created accordingly.

2. To optimize the code of each dynamic page

A dynamic page is anything else only one page HTML created to measure by a script. In general such a page rests on a model of page, showing the design of the remainder of the site, and comprising certain zones whose contents are generated by carrying out requests in a data base.

It is necessary to optimize the code of page HTML for the referencing in the same way as for a static page.

I advise you to envisage on all these pages the following zones, to fill in a single way for each page:

  • The title (<title> marks out).
  • The information on the page, useful for example for the directories and to a lesser extent for the engines (beacons meta: <meta description> and <meta keywords>).
  • A title in the textual contents, in top of page (<h1> marks out).
  • A descriptive text which summarizes the page, to place highest possible on the page (<p> marks out).
  • Bonds towards the related pages (<a> marks out).

If you respect these instructions, you will quickly have a site of which the thousands of pages will be indexed and all optimized for referencing!

Conclusion

The dynamic sites are increasingly current today, because they bring a great flexibility of management and make it possible to exploit great quantities of information. Even if their use is democratized, they today are still very often badly conceived in term of referencing, since they often cumulate many blocking factors (identifiers of session, URL complexes).

The installation of the rewriting of URL is a sometimes long, complex and technical work, but which makes it possible to obtain results without common measurement with the static sites. Once installation well, the rewriting of URL (partner to a dynamic optimization of the pages) very often makes it possible to position the site on Google or the search engines “of the market” for thousands of expressions rather than a few tens as it is the case usually with the static sites.

Leave a Reply