|
How To Not Make Duplicate Content In Your WordPress Blog
By Barry Welford
Expert Author
Article Date: 2009-06-04
The best way to ensure a web page ranks well in Google keyword searches is to make sure it is the only one on the web that includes the content on the page. In this way you avoid several web pages all having a somewhat equal possibility of being judged relevant for the particular keyword search. This increases the chance that this unique page will outrank other quite independent web pages that cover the same topic. That's the theory and it seems to work out well in practice.
Wordpress is a great software for producing blogs but out-of-the-box the WordPress content management system produces a series of pages that all contain the same content. Just see the concerns expressed in this WebmasterWorld thread about WordPress And Google: Avoiding Duplicate Content Issues where several coding suggestions were offered to avoid the problems. More recently, David Bradley has suggested that something called the canonical link element can be the solution to Avoiding Duplicate Content Penalties.
We should quickly add that this is not an inherent weakness of WordPress alone since many other CMSs will suffer from similar problems. It is a well known problem and you can find an excellent article on how to Avoid Duplicate Content on Wordpress Websites, which gives the appropriate steps to take. The most important step of all is to have the right robots.txt file.
I wondered how well people were grappling with this duplicate content problem and decided to check out some of the Technorati's Blogger Central / top 100 blogs. In particular I thought a check of their robots.txt files would give an indication on whether they had tried to solve the problem. Here is what I found for the robots.txt files for the most popular 8 blogs.
- The Huffington Post
# All robots will spider the domain User-agent: * Disallow:
# Disallow directory /backstage/ User-agent: * Disallow: /backstage/
- TechCrunch
User-agent: * Disallow: /*/feed/ Disallow: /*/trackback/
- Engadget
(empty)
- Boing Boing
User-agent: * Disallow: /cgi-bin
- Mashable!
User-agent: * Disallow: /feed Disallow: /*/feed/ Disallow: /*/trackback/ Disallow: /adcentric Disallow: /adinterax Disallow: /atlas Disallow: /doubleclick Disallow: /eyereturn Disallow: /eyewonder Disallow: /klipmart Disallow: /pointroll Disallow: /smartadserver Disallow: /unicast Disallow: /viewpoint
Disallow: /LiveSearchSiteAuth.xml Disallow: /mashableadvertising2.xml Disallow: /rpc_relay.html
Disallow: /browser.html Disallow: /canvas.html User-agent: Fasterfox Disallow: /
- Lifehacker
User-Agent: Googlebot Disallow: /index.xml$ Disallow: /excerpts.xml$ Allow: /sitemap.xml$ Disallow: /*view=rss$ Disallow: /*?view=rss$ Disallow: /*format=rss$ Disallow: /*?format=rss$ Disallow: /*?mailto=true
- Ars Technica
User-agent: * Disallow: /kurt/ Disallow: /errors/
- Stuff White People Like
User-agent: IRLbot Crawl-delay: 3600
User-agent: * Disallow: /next/
# har har User-agent: * Disallow: /activate/
User-agent: * Disallow: /signup/
User-agent: * Disallow:
As you may notice, the most popular blogs seem to have a singular disregard for this issue with minimal robots.txt files. As you come down the list, it would seem that even these top blogs realize the importance of limiting what the search engine robots crawl and index.
The impetus for exploring this issue came after noticing an additional complication that results if you put An Elegant Face On Your WordPress Blog by using Multiple WordPress Loops.
This results in many extra web pages that humans would likely not see but search engine spiders would certainly crawl. These occur if you go to the Home Page and then check through the Previous Pages successively. This required an extra line in the standard robots.txt file so the current robots.txt file for this blog appears as follows:
User-agent: * Disallow: /page/ Disallow: /wp-login.php Disallow: /wp-admin/ Disallow: /wp-register.php Disallow: /wp-login.php?action=lostpassword Disallow: /index.php?paged Disallow: /?m Disallow: /test/ Disallow: /feed/ Disallow: /?feed=comments-rss2 Disallow: /?feed=atom Disallow: /?s= Disallow: /index.php?s Disallow: /wp-trackback Disallow: /xmlrpc Disallow: /?feed=rss2&p
Conclusion
Getting the robots.txt file correct is one of the easiest ways of increasing the visibility of your blog pages in search engine keyword searches. Leaving two essentially similar web pages means that the two divide up the ‘relevance' that a single web page would have. That means approaching a 50% reduction in potential keyword ranking. Perhaps the top blogs can ignore such improvements but most of us should not. Check out what the spiders may crawl by doing an evaluation of your website with Xenu Link Sleuth. We should carefully consider our robots.txt files and make sure they are doing an effective job. Is yours?
Comments
About the Author:
Barry Welford, President of SMM Internet Marketing Consultants works with business owners and senior management on Internet Marketing strategy and action plans to grow their companies. He is a moderator at the Cre8asite Forums and writes on Business and the Internet in four blogs, Senior Money Memos, BPWrap, StayGoLinks and The Other Bloke's Blog.
|
|