If you have your own domain, this is a simple test to see if you might have a canonicalization problem…
Go to Edward Lewis’s great header checking tool at his SEO website. Put in your main website address as you think of it. For example, my website address that give out is http://www.1918.com – change the User Agent to “Googlebot”, then click Check Headers.
What you want to get back is something like:
1. REQUESTING: http://www.1918.com GET / HTTP/1.1 Connection: Keep-Alive Keep-Alive: 300 Accept:*/* Host: www.1918.com Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SERVER RESPONSE: 200 OK
What that tells me is that my web server is responding exactly how I want it to.
But what about the multiple other ways people may try to get to my site? Without the www, with or without the trailing slash, with or without the filename index.php
Let’s check in order of common problems:
Without the www in front of the domain:
1. REQUESTING: http://1918.com GET / HTTP/1.1 Connection: Keep-Alive Keep-Alive: 300 Accept:*/* Host: 1918.com Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SERVER RESPONSE: 301 MOVED PERMANENTLY Date: Tue, 07 Sep 2010 02:04:30 GMT Server: Apache/2.2 Location: http://www.1918.com/ Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 190 Keep-Alive: timeout=2, max=10 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 Redirecting to http://www.1918.com/ ... 2. REQUESTING: http://www.1918.com/ GET / HTTP/1.1 Connection: Keep-Alive Keep-Alive: 300 Accept:*/* Host: www.1918.com Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SERVER RESPONSE: 200 OK
Perfect. What I want to happen is that any request that comes in without the www, I just want to re-route it to my one canonical url. The reason is, I don’t want links to be split between 2 different pages that are actually identical.
Now let’s check to see if the trailing slash causes any problem – there wasn’t, same as first test.
Finally, with index.php appended to the main url:
1. REQUESTING: http://www.1918.com/index.php GET /index.php HTTP/1.1 Connection: Keep-Alive Keep-Alive: 300 Accept:*/* Host: www.1918.com Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SERVER RESPONSE: 301 MOVED PERMANENTLY Date: Tue, 07 Sep 2010 02:19:18 GMT Server: Apache/2.2 Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache X-Pingback: http://www.1918.com/xmlrpc.php X-Powered-By: W3 Total Cache/0.9.1.2 Set-Cookie: PHPSESSID=4c36183d09e3acaecef7c7e1ca7a13a7; path=/ Vary: Accept-Encoding,User-Agent Location: http://www.1918.com/ Content-Encoding: gzip Content-Length: 20 Keep-Alive: timeout=2, max=10 Connection: Keep-Alive Content-Type: text/html; charset=UTF-8 Redirecting to http://www.1918.com/ ... 2. REQUESTING: http://www.1918.com/ GET / HTTP/1.1 Connection: Keep-Alive Keep-Alive: 300 Accept:*/* Host: www.1918.com Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SERVER RESPONSE: 200 OK
Again, exactly what I want to happen. For the same reason as last time, I want any link to my homepage to point at one page, not one of it’s cousins, so I make sure all doors lead back to the main entry.
If you did this test and your server answered for some of these variations, you may have a dreaded canonicalization problem! I’ve talked about one quick way to fix canonicalization but if you need more help, let me know.
solopixelJason says
Excellent explanation Phil been going round in circles as the two tests I was using gave out conflicting results, & neither allowed me to see as much detail and drill down on the exact issue as I can now thanks to this tutorial.
using this I can now see that www redir to no www ok, (yippee) but there is definitely a duplicate using the trailing slash on the home page only, everywhere else seems ok as I can see for myself now all moved permanently! for redir index.html using;
# redirect html pages to the root domain
RewriteRule ^index.html$ / [R=301,L]
RewriteRule ^(.*)/index.html$ /$1/ [R=301,L]
So not sure if this is wrong
Anyway I’ll keep plugging away bound to sort it sooner or later but thanks for all this again, ha ha did I say thanks yet 🙂