
Duplicate content has long been a problem on the web. Not just a problem for content producers – who create great content only to see places like Mahalo come scrape your content and use it on their site to generate income for themselves.
Sometimes it’s a problem for a normal company. At McClatchy newspapers, they used to publish the same story into multiple sections because it fit in both “local” and “business”. Bingo – duplicate content.
On February 12, 2009 Google and Yahoo announced support for a rel=canonical tag.
That tag allows a web page to tell the search engine spiders, “I’m not the real page for this content, I’m just a cousin, the real page is located at…”
Using the newspaper as an example, they would designate one of those story versions as the canonical version, and the other page would have a meta tag in the head pointing at it.
As good as that was, it still left a big gap that people asked about the minute the original announcement was made. What about cross-domain tagging.
Some cross-host canonical tagging works. I can say that www.1918.com/foo.html is the canonical page even though backup.1918.com/foo.html has the same exact article. Intra-domain canonicalization works now. What doesn’t work is true cross-domain rel= canonical tagging.
That may be changing soon. During his April 13th Webmaster Help video, Matt Cutts says “…we are very far along the path of being able to look at rel=canonical and apply it across domains.”
If you syndicate content, make sure you include the canonical tag pointing back at the original source.
Photo by: Sam
That is actually a really cool idea, and something I never really considered. It would be a pain if someone started linking to an off-site version of your article and wasn’t helping drive traffic to your original domain. I wonder how Google would factor this into your site’s reputation, though. Would it still give the off-site article the same relevance? Would it combine them both for the original site? How much can the two pages differ before Google decides they’re not duplicates at all and then dings you for trying to drive traffic to unrelated pages, if this tag does factor into page ranking? Or am I thinking about this way too hard?
Joshua – Google says that the pages need to be close (but not exact).
Also, if page A is the canonical version for page B, then Google will not whack you for duplicate content, because you were smart enough to use the rel=canonical tag.
So in effect, it does “combine them” for you.
Also, another situation where a lot of people don’t understand canonical tags are with duplicate versions of a website:
1. http://www.website.com
2. website.com
3. http://www.website.com/index.html
4. website.com/index.html
All are different in the eyes of a search engine, and a canonical tag is a great way to tell the engines which one of these is the real one (even if the webmaster doesn’t set it up themselves on the server side).
Great post!