Archive for January, 2006

How do I spider a protected area with a Google Mini?

January 30, 2006 in Google Mini,Q&A,Spidering | Comments (8)

If you have a protected area of your website, but still want to spider it and have the results available to searchers (e.g. if you have an extranet that you want to offer searching on) then you can give a Google Mini a username and password to give access to the area to spider.

For instance, you have used .htaccess and .htpasswd files to protect a directory on your website called ‘clientarea’, and you need to use the username ‘myuser’ and password ‘mypass’ to access it.

In the Mini’s Admin area, go in to ‘Configure Crawl’, if there isn’t a direct link from another area you are spidering, add the URL of the protected area to ‘Start Crawling from the Following URLs.’ Note: when I tried this, I had to add a direct link to a particular file, rather than just to the directory. So in this example we can add in:

http://webpositioningcentre.co.uk/clientarea/index.php

Click ‘Save URLs to Crawl’ and go to the ‘Crawler Access’ area.

In the area labelled ‘Users and Passwords for Crawling’ you need to put the URL of the protected area, and a username and password to access it. If you also need to set a domain for the protected area, fill in that box as well.

For URLs Matching Pattern: http://webpositioningcentre.co.uk/clientarea/
Use this user: myuser
With Password: mypass (same in Confirm Password)

Click on ‘Save Crawler Access Configuration’ and you’re ready to go. It will remove the password stars from the boxes, but will remember the password.

Next time you crawl your sites, it will access the protected area as if it was a user giving over the set username and password. In the search results it will show Titles, URLs and a snippet of the page as usual, but when a searcher clicks on the link they will need a correct username and password to access the area.

Warning: searchers will be able to click on the ‘Cached’ link and view the contents of the page. To stop the Google Mini caching the page, in the area of the page in question put the following code:

<meta name="robots" content="noarchive" />

This will stop the Mini, and any other search engine you allow access to the pages from storing a cached version of the page. They will still show part of the page in the snippet as part of the results, but searchers won’t receive a ‘cached’ link to click on.

How do I access the XML from the Google Mini / GSA?

January 26, 2006 in Google Mini,GSA,Q&A,XML API | Comments (8)

As well as the standard web interface, the Google Mini and Google Search Appliance have an XML interface which gives you a results set back in XML.

To access the XML, you use a scripting language to use HTTP GET with a particular URL:

For XML without a DTD:
http://www.miniaddress.com/search?q=searchphrase&output=xml_no_dtd &client=collectionname&site=collectionname

Where ‘www.miniaddress.com’ is the address of your search appliance (this can also be an IP address,) ‘collectionname’ is the name of your collection, and ‘searchphrase’ is what you are searching for.

If you want the DTD, change output=xml_no_dtd to output=xml

You can set lots of flags in the URL to do things like change the start number of the results set, or change the encoding of the results coming back to UTF-8 or latin. You can look up the various flags in the GSA XML reference.

How to spider hidden content

January 25, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (2)

This answers a question I’ve been asked in a couple of different ways…

Q: I have a page that uses many javascript popup windows for news articles, how can I index them?
Q: On a webserver I have a directory full of html pages. These pages are NOT listed anywhere. Can the google mini return search results where these html files will also be included?

If you have some pages that are usually off-limits to spiders, you can make sure your Search Appliance or Mini spiders them in a couple of ways:

1. Put the exact URL of each page to be spidered in the list of places to be spidered in the crawling admin – if you have many pages, this will become a maintenance problem.

2. Use a sitemap page which is not indexed by the appliance – the easy way to do it.

To do 2, make a list of all the pages you want spidered that are being missed out because there is not a direct route to them for the spider – i.e. Javascript is getting in the way, or the only route to them is blocked via robots.txt or something similar. This does not need to be a fancy page, it’s just a list for the spider to see and no people will ever need to see it.

In the HTML of this page, between the tags, put the following line:

<meta name=”robots” content=”noindex, follow” />

This means the spider will read the page and follow all links on it, but the page itself will not be indexed. If it isn’t indexed, it can’t be shown in the search results, so no-one can find it.

Now give the GSA or Mini the address of this sitemap page to crawl. Any time you add new pages to your site that are not getting spidered, you can add their address to the sitemap page.

NB: Google and the other big search engines also follow this ‘robots’ meta tag command, however you will need a link to the sitemap page from another part of your site for them to spider the pages and find them interesting enough to keep in their index, so if you use it to expose pages to the public search engines you will need your sitemap to look prettier as people might click through a link on your website to the page.

Setting a unique user agent to help control spidering

January 23, 2006 in Google Mini,GSA,Spidering | Comments (1)

When your Google Mini or Google Search Appliance spiders websites, it has a ‘User Agent.’ Practically everything that reads documents on the web has a User Agent, for instance when I use Internet Explorer to view a web page, it send this:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

And Firefox sends this:

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

When you first get a Mini or Search Appliance, it sends the following by default:

gsa-crawler (Enterprise; [code]; gsa-admin@test.com)

The [code] changes depending on your appliance – the Mini and GSA I use have different codes. The e-mail address is set to the admin person so you can be contacted if the spidering is going too fast. The bit we’re interested in here is the ‘gsa-crawler’.

You can reset this to something more relevant to your project:

In the Mini:
View/Edit the collection, go to Crawler Parameters, type in the new ‘User Agent Name’

In the Google Search Appliance:
Crawl and Index, then HTTP Headers, type in new name in ‘Enter User Agent Name’

When you’ve set it to something unique, you can use the robots.txt file of your website to control where it spiders.

So, say you have set your User Agent to ‘gsadeveloper-spider’, it will now be sending through:

gsadeveloper-spider (Enterprise; [code]; gsa-admin@test.com)

And in your robots.txt file you can exclude areas depending on this name. So on my main work site, Web Positioning Centre I could set the robots.txt file to exclude the Labs area by setting the following in the file:

User-agent: gsadeveloper-spider
Disallow: /labs/

This will only block documents in the labs directory, and it will only effect my GSA spidering as ‘gsadeveloper-spider’

This can be useful if you want to block areas of sites off for projects, for instance if you’re working on a large site and you have the lowest level Google Mini, you may want to block irrelevant pages from being spidered and using up some of the 100,000 document limit. But it only blocks it off for you, not any other GSA’s spidering the site, and not the big search engines like Google, Yahoo and MSN Search.

When blocking an area, you can make sure they were blocked either by searching for words in that document or checking whether it was excluded in the crawling report. In the Mini you get to this through:

View/Edit the collection, go to System Status, then Serving Status, Browse the hierarchy of URLs seen during crawl. This will show you a list of the websites crawled. Click the number under ‘Excluded URLs’ for the site where you are blocking content. This will give you a list of URLs which were blocked, pages blocked using robots.txt will have “Crawled with empty body: Disallowed by robots.” next to them – this only happens for pages which have direct links to them from a crawled part of the site, it will not show you other pages in the same section which can only be got to through the blocked area, as it won’t have gone in to that section to find the links to them.

Using ‘disallow’ in robots.txt stops anything in that directory being read, and can be used to disallow single documents as well, you can find out more on the robotstxt site, which has an explanation of the standard and example code.

More Google Minis to fill gap to GSA

January 17, 2006 in Google Mini | Comments (1)

Google have released new Google Minis which will index 200,000 and 300,000 documents, priced at basically double then triple the price of the standard 100,000 document Mini.

You can buy them in America, but not in the UK yet.

What the Google Mini will spider

January 10, 2006 in Google Mini,Spidering | Comments (2)

Google have a helpful list of the file formats the Google Mini will spider.

It’s worth checking what you want to spider before you consider the other factors in why you are buying a search appliance. Checking through the various file formats I have, I was surprised to see the Mini supports .wps files written by Microsoft Works for DOS. It’s not a difficult format to read, but it is an old format now – Works v2 being copywrite 1988 if my memory serves me correctly. Personally I have a ton of old Works files and it’s nice to know something will still understand them, I told Google Desktop they were txt files with an odd extension, but that can have dubious results as some of the file is binary.

How does the GSA / Google Mini count documents?

January 8, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (0)

The Google Mini has a document limit of 100,000 documents, the Google Search Appliance between 500,000 and 2 million, although you can put several GSA’s together so you can raise that up higher.

Each ‘document’ is a page or file which has a unique address, so

d:\Project Weevils\documents\Squared burrows.doc
http://intranet/
http://webpositioningcentre.co.uk/index.php?article=5435345
and http://webpositioningcentre.co.uk/index.php?article=2947

are all documents. To a developer, the last two URLs may look like the same page, showing different content, to the GSA, that doesn’t matter – they are unique URLs, so they are different documents and will be counted separately.

So, with the Mini you get to index 100,000 individual addresses. If you have a dynamic site which shows all it’s content via variables on one address, then each page served dynamically on a different variable is counted as a separate document.

A problem with this is that if you are using multiple variables in your URLs and your page / CMS can show the same content under different URLs, you can have the same content spidered several times under different addresses – this is one variety of ‘spider trap.’ If this might happen to your site, you should look at trying to exclude it happening by giving a spiderable path to one set of the content and ignoring others using regular expressions. I’ll try to write more on that soon.

What ports does a Google Mini have?

in Google Mini,GSA,Q&A | Comments (0)

Although the Google Mini is a server, and is usually used through it’s web based interface, it has the follow ports available if you want to plug in to it directly:

  • Power (standard kettle lead)
  • PS/2 ports for keyboard and mouse
  • 2 x USB ports
  • Printer
  • VGA
  • 9 pin serial
  • 2 x ethernet ports

Google Mini back

Click on the picture to see a larger version

Google Mini, special delivery

January 7, 2006 in Google Mini | Comments (0)

Earlier in the week I picked up a new Google Mini, delivered to our client, to take it over to Nathan for checking and hosting.

Google Mini on back seat

Everything seems to be OK with it, the admin interface is a little different from it’s larger brother, the Google Search Appliance, but after a bit of sniffing around all the bits I expected are in there. I’d presumed they’d have a practically identical interface to make it easier on the software development team, but apparently not.