Setting a unique user agent to help control spidering

January 23, 2006 in Google Mini,GSA,Spidering | Comments (1)

When your Google Mini or Google Search Appliance spiders websites, it has a ‘User Agent.’ Practically everything that reads documents on the web has a User Agent, for instance when I use Internet Explorer to view a web page, it send this:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

And Firefox sends this:

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

When you first get a Mini or Search Appliance, it sends the following by default:

gsa-crawler (Enterprise; [code];

The [code] changes depending on your appliance – the Mini and GSA I use have different codes. The e-mail address is set to the admin person so you can be contacted if the spidering is going too fast. The bit we’re interested in here is the ‘gsa-crawler’.

You can reset this to something more relevant to your project:

In the Mini:
View/Edit the collection, go to Crawler Parameters, type in the new ‘User Agent Name’

In the Google Search Appliance:
Crawl and Index, then HTTP Headers, type in new name in ‘Enter User Agent Name’

When you’ve set it to something unique, you can use the robots.txt file of your website to control where it spiders.

So, say you have set your User Agent to ‘gsadeveloper-spider’, it will now be sending through:

gsadeveloper-spider (Enterprise; [code];

And in your robots.txt file you can exclude areas depending on this name. So on my main work site, Web Positioning Centre I could set the robots.txt file to exclude the Labs area by setting the following in the file:

User-agent: gsadeveloper-spider
Disallow: /labs/

This will only block documents in the labs directory, and it will only effect my GSA spidering as ‘gsadeveloper-spider’

This can be useful if you want to block areas of sites off for projects, for instance if you’re working on a large site and you have the lowest level Google Mini, you may want to block irrelevant pages from being spidered and using up some of the 100,000 document limit. But it only blocks it off for you, not any other GSA’s spidering the site, and not the big search engines like Google, Yahoo and MSN Search.

When blocking an area, you can make sure they were blocked either by searching for words in that document or checking whether it was excluded in the crawling report. In the Mini you get to this through:

View/Edit the collection, go to System Status, then Serving Status, Browse the hierarchy of URLs seen during crawl. This will show you a list of the websites crawled. Click the number under ‘Excluded URLs’ for the site where you are blocking content. This will give you a list of URLs which were blocked, pages blocked using robots.txt will have “Crawled with empty body: Disallowed by robots.” next to them – this only happens for pages which have direct links to them from a crawled part of the site, it will not show you other pages in the same section which can only be got to through the blocked area, as it won’t have gone in to that section to find the links to them.

Using ‘disallow’ in robots.txt stops anything in that directory being read, and can be used to disallow single documents as well, you can find out more on the robotstxt site, which has an explanation of the standard and example code.

Comments (1)

RSS feed for comments on this post.

  1. Comment by Lalith — April 14, 2011 @ 3:22 pm

    Thanks for the information!
    I am trying to feed pdfs to GSA, the error I am getting is “Crawled with empty body: Conversion error.”

    Can you help me resolving this issue?

Leave a comment

Sorry, the comment form is closed at this time.