Tony Murphy
Welcome!
Hello and welcome. My name is Tony Murphy and this blog is focused on those building and improving a Web Business using Wordpress.
Practical Advice On Building And Promoting A Web Business

This post describes how to use a combination of robots.txt and robots meta tags to make sure that your Wordpress blog does not leak pagerank or authority.

Search engine robots want your Wordpress blog content. They are programmed to crawl your site, look at everything and report back to the Master Indexer with their findings. The Master Indexer then makes sure that your content can be found. However there are some things that robots in their relentless content crunching march should not have access to. For example the indexing of duplicate content on your blog can lead to the dilution of your blogs authority.

This post addresses this and other related problems by outlining how you can ensure that all your relevant content is crawled and indexed, and at the same time access to non-relevant areas of your blog are restricted.

Robots are dumb they just follow the Master Indexers orders. When they reach your site they are programmed to crawl everything. However with the right tools you can re-program them to only crawl the parts of your site that should indexed.

Why The Robots Must Be Controlled?

We all want the search engines like Google to index our fantastic and unique blog pages, and drive hordes of targetted traffic to them. However there are some non content focussed pages and directories that we don’t want crawled or indexed.

What Parts Of Our Blog Do We Not Want Crawled By The Search Engines?

The main parts of our blog that we don’t want crawled are:

  • the Wordpress installation directories, and
  • any potential duplicate content like archives

What Is The Benefit Of Stopping Robots Seeing The Wordpress Directories?

There are three main benefits to stopping your directories been indexed. These are:

  • no-one can inadvertently see what we have in our Wordpress installation, eg which plugins we use
  • the “theme” of our site won’t be diluted by the indexing of spurius non-relevant files
  • Googlebot has less work to do to index our site

How Do We Stop The Robots Crawling Our Directories?

The robots.txt file which resides in the top level directory of our site contains rules that robots must obey. Think of these rules as Jedi Mind Melds that robots cannot resist and must follow. A detailed description of robots.txt is beyond the scope of this post but you can find some excellent info on creating a robots.txt file at the following address: How To Use And Create A Robots.txt File

What Should We Include In A Wordpress Robots.txt File?

The following is a simple example that can be used as the basis for your Wordpress Robots.txt file:

User-agent: *
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Allow: /wp-content/uploads

What this says is:

  • allow all robots to crawl my Wordpress blog, but
  • don’t allow the wp-admin directory to be crawled
  • don’t allow the wp-includes directory to be crawled
  • don’t allow the plugins directory to be crawled
  • don’t allow the cache directory to be crawled
  • don’t allow the themes directory to be crawled
  • do allow the any content uploads such as images to be crawled

Robots.txt files can be more complex but this is a good start. A more detailed example follows but I would suggest that you do not implement it unless you understand exactly what it is doing:

User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads

You can use the robots.txt tool provided by Google Webmaster tools to check out and create your robots.txt file.

Why Do We Want To Stop Wordpress Archives Been Crawled?

First Things First - There is no Google duplicate content penalty. However, the Google algorithms will decide what version of a particular piece of content is going to rank best so there may be the appearance of a duplicate content penalty because no-one knows how the algorithms are coided and they change over time. However authority sites are the best place to put your unique content as they will rank better than smaller and newer sites. This is because over time authority sites (when done well) build up trust.

Now thats out of the way lets look at why you might want to not have your archieves crawled and indexed. If you have a piece of good and unique content  and it is in multiple places on your site (date, category, tag, author archives) then the robots will crawl each of these “duplicates” and report back to the Master Indexer. The problem is that this will give the Master Indexer a headache as it will have to decide which of these duplicates is going to rank higher. This is compounded by the fact that each of these duplicates may actually get links from different blogs and sites. The net result is that the authority of your main content post is diluted and you don’t want that to happen. If your site is freash and new this is not going to make much of a difference but as your site grows it will make a difference. Besides its always good to start any new endevour with the right approach and develop good habits.

How Do We Stop Robots Crawling Our Wordpress Archives?

In order to control this duplicate content issue with archives we can tell the robots what archives to crawl and which to ignore with meta robots tags. There is no easy way to use the meta robots tags that are needed to do this but there is a very good plugin that will help you to do the job. The Meta Robots Wordpress Plugin allows you to control which archives are crawled and indexed. This plugin makes its easy to:

  • stop robots from indexing your Wordpress login, register and admin pages
  • allows you to disable author based archives
  • allows you to disable date based archives
  • allows you to nofollow the category listings on single posts and pages
  • allows you to nofollow outbound links on your frontpage
  • and nofollow tag links

It also has a number of other useful features. This is one plugin that I always use.

Summary

So we can see from this post that a combination of robots.txt and robots meta tags can be used to ensure that your Wordpress blog does not leak any authority or pagerank. Do please comment if you have any questions or would like to start a discussion.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • StumbleUpon
  • Reddit
  • PlugIM
  • Propeller
  • Sphinn
  • Mixx
  • Facebook

Related Posts

  • No Related Post

If you like this post Subcribe to my RSS feed to receive new posts

11 Responses to “Bloggers Guide To Using Robots.txt And Robots Meta Tags To Optimize Wordpress Indexing”

  1. Ken Says:

    Tony, you have a great resource here! Thanks for the info!

  2. Michelle Adams Says:

    Hi Tony,
    This robot business has always seemed so confusing to me…you’ve just spelt it out in ‘plain english’, thank you!

    Now I know I must go fix it, GULP. I hope that plug-in is easy to work with.:)

    Thanks again for an informative post.

  3. Jayblogger Says:

    Hi Tony,

    Another great post. I have you set up in my feed reader now!

    Thanks for the meta robots plugin link - That makes life easier.

    Jason

  4. David Rogers Says:

    Hello
    I’ve been on wordpress 18 months not knowing this so thanks for the advice. If you use the plugin, do you still need the robots file?

  5. Maria | Never the Same River Twice Says:

    Thanks. Even I (a very nontechnical person) can follow this one!

  6. Nassorn Says:

    Hi Tony,
    Your contributions always value to me. I become your subscriber.
    Thanks.

  7. Tony Murphy Says:

    Hey, thanks for allo the great feedback

    Michelle - the plugin is easy to work with

    David - the plugin does a lot of the work for you, but if you don’t want your directories indexed you will need the robots.txt file. However there is a robots.txt section in the plugin settings menu. You can use this to enter your robots.txt info rathar than uploading one. It will then create the robots.txt file for you. Before you do this you should use the Google Webmaster Tools to check that your roobots.txt file works and allows access to all your pages etc.

    Nassorn - sounds great

    cheers
    Tony

  8. Donna Miller Says:

    Great article Tony, I didn’t know about this plugin, so thanks for the resource.

    I think people who don’t already know about the difference between the robots.txt file and the robots meta tag may be a bit confused though because you don’t mention the meta tags until the very end. But I take it the plugin works with both? That would be great to only have one place to go to change all the robot commands.

    Thanks again for a great resource!

  9. Tony Murphy Says:

    Donna - good point, I may expand this guide to cover more info. The plugin does allow you to create a robots.txt file as well as use the robots tags to “sculpt your sites pagerank.” However its important that you test your robots.txt file using the Webmaster tools before adding it to your wordpress installation.

  10. Lance Nelson Says:

    Hi Tony, very useful info here, thank you.

    Lance

  11. Using Robots.txt To Avoid Duplicate Content Penalties | Writing The Perfect Article Says:

    [...] to Tony Murphy, Search engine robots want your Wordpress blog content. They are programmed to crawl your site, [...]

Leave a Reply