Archive for the ‘Spidering’ Category

Ignoring specific content on a page

July 28, 2006 in Google Mini,GSA,Spidering | Comments (2)

If you want your Google Mini or Search Appliance to ignore part of your page, you can use some special tags to stop the content being indexed (and therefore brought back in the search results.)

Surround the content you want ignored with the following tags:

<!-- googleoff: index --> <!-- googleon: index -->

So if you have

<!-- googleoff: index --> I like bees <!-- googleon: index -->

On your page and you search for ‘bees’, it won’t come up, even if the page has been spidered. The only people who will find out about your love of buzzing insects will be those who have found the page through other means.

This can be useful for excluding parts of your page that the appliance might find confusing, for instance ‘H’ wants to exclude his breadcrumb trail.

Crawling status in your Google Mini

May 4, 2006 in Google Mini,Spidering | Comments (0)

When you’re spidering (or ‘crawling’) using your Google Mini, you get a little box telling you it’s status – or at least you do on the old/v1 Mini, I haven’t tried a new Mini yet.

Within the box you get a list of:
Total Inflight URLs – these are links it has found, but not read the pages of yet.

Total Crawled URLs – pages that have been read

Locally Crawled URLs – this may come up when you’re re-spidering. It lists pages which have not changed since the last spidering, so it just reads it’s internal copy which saves reading the live pages again, saving time and bandwidth.

Excluded URLs – these are pages that have been excluded from being read for some reason, either via a robots.txt exclusion, or because of one of the rules you have set in the ‘Configure Crawl’ section.

All of these labels are links, if you click on them you get a list of the various documents in each section. Clicking on ‘Total Crawled URLs’ gives you a list of pages normally labelled ‘New Document’, within ‘Locally Crawled URLs’ they will be labelled ‘Cached Version.’

What makes a ‘New Document’?

By default, the Google Mini will look at the ‘Last-Modified’ header of the page. If it is more recent than the last spidered version of the page that it has stored, it will read and index the live page.

If your pages are straight HTML, then Last-Modified will be when they were created on the server – so when they were uploaded by FTP in most cases. However, if your pages are dynamic, for instance PHP or ASP pages, and use information from a database or even include files, then their Last-Modified date will usually be the moment they were served up to the Mini, or anyone visiting your site. This means the Mini will read all of the pages, even if they haven’t changed since it last spidered them, as the web server is telling them it’s a recently changed page because of the date. This does not hurt your site in any way, but it means you will use up more time and bandwidth in your spidering as it could be reading pages which haven’t changed since they were last spidered.

Finding your pages ‘Last Modified’ date

If you use Firefox, you can install the Web Developer Toolbar for it. When you are looking at a page, click on ‘Information’ then ‘View Response Headers’ and you’ll see the ‘Last-Modified’ date being reported to your browser and any passing web spiders.

Avoiding session IDs when spidering

April 20, 2006 in Google Mini,GSA,Spidering | Comments (4)

Many web sites use sessions to keep track of visitors as they browse around. If you do not have cookies turned on in your browser, the cookie may be sent through the URL so the site can still track you. This is very useful if it’s storing your shopping basket information, but it can have drawbacks.

Unfortunately sessions in the URL can upset spidering – a Google Search Appliance or Mini will generally up several ‘connections’ to a web site when it is spidering, this is like having several independent people browsing the site at the same time. Each of these connections receives a different session ID, which makes the URLs look different to the spiders. This in turn means each connection may spider the same pages that have all ready been covered. Also, if the session times out it may be replaced by a new session when the next page is spidered, which means that again the spider will re-read pages it has all ready found. This is because this:


And this:


Look like different pages, even though they may turn out to have the same content. To avoid this happening, you can stop the spider reading pages which have session IDs in the URL. You can avoid the most common session IDs by adding these lines to the ‘Do Not Crawl URLs with the Following Patterns:’ section of ‘URLs to Crawl':


The web sites you are spidering may still contain session IDs, it is worth checking with the site owner if this is going to be a problem, and keep an eye on the ‘Inflight URLs’ shown in ‘System Status’ – ‘Crawl Status’ when spidering a site for the first time. If the same URLs are turning up a lot, you may have a session problem. You’ll need to stop spidering the site and work out which bit of the URL you need to ignore, then you can add it to do not crawl list like the examples above.

How do I spider a protected area with a Google Mini?

January 30, 2006 in Google Mini,Q&A,Spidering | Comments (8)

If you have a protected area of your website, but still want to spider it and have the results available to searchers (e.g. if you have an extranet that you want to offer searching on) then you can give a Google Mini a username and password to give access to the area to spider.

For instance, you have used .htaccess and .htpasswd files to protect a directory on your website called ‘clientarea’, and you need to use the username ‘myuser’ and password ‘mypass’ to access it.

In the Mini’s Admin area, go in to ‘Configure Crawl’, if there isn’t a direct link from another area you are spidering, add the URL of the protected area to ‘Start Crawling from the Following URLs.’ Note: when I tried this, I had to add a direct link to a particular file, rather than just to the directory. So in this example we can add in:

Click ‘Save URLs to Crawl’ and go to the ‘Crawler Access’ area.

In the area labelled ‘Users and Passwords for Crawling’ you need to put the URL of the protected area, and a username and password to access it. If you also need to set a domain for the protected area, fill in that box as well.

For URLs Matching Pattern:
Use this user: myuser
With Password: mypass (same in Confirm Password)

Click on ‘Save Crawler Access Configuration’ and you’re ready to go. It will remove the password stars from the boxes, but will remember the password.

Next time you crawl your sites, it will access the protected area as if it was a user giving over the set username and password. In the search results it will show Titles, URLs and a snippet of the page as usual, but when a searcher clicks on the link they will need a correct username and password to access the area.

Warning: searchers will be able to click on the ‘Cached’ link and view the contents of the page. To stop the Google Mini caching the page, in the area of the page in question put the following code:

<meta name="robots" content="noarchive" />

This will stop the Mini, and any other search engine you allow access to the pages from storing a cached version of the page. They will still show part of the page in the snippet as part of the results, but searchers won’t receive a ‘cached’ link to click on.

How to spider hidden content

January 25, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (2)

This answers a question I’ve been asked in a couple of different ways…

Q: I have a page that uses many javascript popup windows for news articles, how can I index them?
Q: On a webserver I have a directory full of html pages. These pages are NOT listed anywhere. Can the google mini return search results where these html files will also be included?

If you have some pages that are usually off-limits to spiders, you can make sure your Search Appliance or Mini spiders them in a couple of ways:

1. Put the exact URL of each page to be spidered in the list of places to be spidered in the crawling admin – if you have many pages, this will become a maintenance problem.

2. Use a sitemap page which is not indexed by the appliance – the easy way to do it.

To do 2, make a list of all the pages you want spidered that are being missed out because there is not a direct route to them for the spider – i.e. Javascript is getting in the way, or the only route to them is blocked via robots.txt or something similar. This does not need to be a fancy page, it’s just a list for the spider to see and no people will ever need to see it.

In the HTML of this page, between the tags, put the following line:

<meta name=”robots” content=”noindex, follow” />

This means the spider will read the page and follow all links on it, but the page itself will not be indexed. If it isn’t indexed, it can’t be shown in the search results, so no-one can find it.

Now give the GSA or Mini the address of this sitemap page to crawl. Any time you add new pages to your site that are not getting spidered, you can add their address to the sitemap page.

NB: Google and the other big search engines also follow this ‘robots’ meta tag command, however you will need a link to the sitemap page from another part of your site for them to spider the pages and find them interesting enough to keep in their index, so if you use it to expose pages to the public search engines you will need your sitemap to look prettier as people might click through a link on your website to the page.

Setting a unique user agent to help control spidering

January 23, 2006 in Google Mini,GSA,Spidering | Comments (1)

When your Google Mini or Google Search Appliance spiders websites, it has a ‘User Agent.’ Practically everything that reads documents on the web has a User Agent, for instance when I use Internet Explorer to view a web page, it send this:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

And Firefox sends this:

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

When you first get a Mini or Search Appliance, it sends the following by default:

gsa-crawler (Enterprise; [code];

The [code] changes depending on your appliance – the Mini and GSA I use have different codes. The e-mail address is set to the admin person so you can be contacted if the spidering is going too fast. The bit we’re interested in here is the ‘gsa-crawler’.

You can reset this to something more relevant to your project:

In the Mini:
View/Edit the collection, go to Crawler Parameters, type in the new ‘User Agent Name’

In the Google Search Appliance:
Crawl and Index, then HTTP Headers, type in new name in ‘Enter User Agent Name’

When you’ve set it to something unique, you can use the robots.txt file of your website to control where it spiders.

So, say you have set your User Agent to ‘gsadeveloper-spider’, it will now be sending through:

gsadeveloper-spider (Enterprise; [code];

And in your robots.txt file you can exclude areas depending on this name. So on my main work site, Web Positioning Centre I could set the robots.txt file to exclude the Labs area by setting the following in the file:

User-agent: gsadeveloper-spider
Disallow: /labs/

This will only block documents in the labs directory, and it will only effect my GSA spidering as ‘gsadeveloper-spider’

This can be useful if you want to block areas of sites off for projects, for instance if you’re working on a large site and you have the lowest level Google Mini, you may want to block irrelevant pages from being spidered and using up some of the 100,000 document limit. But it only blocks it off for you, not any other GSA’s spidering the site, and not the big search engines like Google, Yahoo and MSN Search.

When blocking an area, you can make sure they were blocked either by searching for words in that document or checking whether it was excluded in the crawling report. In the Mini you get to this through:

View/Edit the collection, go to System Status, then Serving Status, Browse the hierarchy of URLs seen during crawl. This will show you a list of the websites crawled. Click the number under ‘Excluded URLs’ for the site where you are blocking content. This will give you a list of URLs which were blocked, pages blocked using robots.txt will have “Crawled with empty body: Disallowed by robots.” next to them – this only happens for pages which have direct links to them from a crawled part of the site, it will not show you other pages in the same section which can only be got to through the blocked area, as it won’t have gone in to that section to find the links to them.

Using ‘disallow’ in robots.txt stops anything in that directory being read, and can be used to disallow single documents as well, you can find out more on the robotstxt site, which has an explanation of the standard and example code.

What the Google Mini will spider

January 10, 2006 in Google Mini,Spidering | Comments (2)

Google have a helpful list of the file formats the Google Mini will spider.

It’s worth checking what you want to spider before you consider the other factors in why you are buying a search appliance. Checking through the various file formats I have, I was surprised to see the Mini supports .wps files written by Microsoft Works for DOS. It’s not a difficult format to read, but it is an old format now – Works v2 being copywrite 1988 if my memory serves me correctly. Personally I have a ton of old Works files and it’s nice to know something will still understand them, I told Google Desktop they were txt files with an odd extension, but that can have dubious results as some of the file is binary.

How does the GSA / Google Mini count documents?

January 8, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (0)

The Google Mini has a document limit of 100,000 documents, the Google Search Appliance between 500,000 and 2 million, although you can put several GSA’s together so you can raise that up higher.

Each ‘document’ is a page or file which has a unique address, so

d:\Project Weevils\documents\Squared burrows.doc

are all documents. To a developer, the last two URLs may look like the same page, showing different content, to the GSA, that doesn’t matter – they are unique URLs, so they are different documents and will be counted separately.

So, with the Mini you get to index 100,000 individual addresses. If you have a dynamic site which shows all it’s content via variables on one address, then each page served dynamically on a different variable is counted as a separate document.

A problem with this is that if you are using multiple variables in your URLs and your page / CMS can show the same content under different URLs, you can have the same content spidered several times under different addresses – this is one variety of ‘spider trap.’ If this might happen to your site, you should look at trying to exclude it happening by giving a spiderable path to one set of the content and ignoring others using regular expressions. I’ll try to write more on that soon.