Archive for the ‘Q&A’ Category

Can you put new pages or applications on a Google Mini?

March 9, 2006 in Google Mini,Q&A | Comments (1)

I’ve had these questions myself, and been asked it by a few people: “How do I put my website pages on a Google Mini?” and “Can I install my own application on a Mini, e.g. my web app or webstats?”

The answer to both questions is: no, you cannot put your own web pages on the Mini (nor on the Google Search Appliance) and you can’t install your own applications either. They have a web based interface where you can change the look of their own pages somewhat, but beyond that you can’t do anything else. There’s no way of uploading your own pages on to the box, and there’s no way of installing your own applications.

Basically, these things are a plug-and-play box, they don’t like being fiddled with, and although that’s probably rather different from what most webmasters are used to, it does allow Google to be confident about how the box will work, and to avoid support costs covering people who install leaky web apps on to the server and then complain when it eventually crashes.

What the Google Mini is good at, and what it is not good at

March 7, 2006 in Google Mini,Q&A | Comments (4)

I’ve been talking to various people recently, both by e-mail and at the Mini Google Group, who want to use a Mini for things it really isn’t suited for, so hopefully this will be helpful:

What the Google Mini is good at

It is very good at searching through lots of unstructured information, and finding the document (page) that best matches what you are searching for.

So, it is good for searching all your old sales documents on your intranet, to find the reference you made to ‘Singing badgers’

Or it’s good at searching all the articles on your website, for all the ‘orange-spotted tapirs’ that you have written medical items on.

What the Google Mini is not good at

It is not good at comparing pages for specific pieces of information, or for sorting by anything except relevancy or date the page was made.

For instance, if you have a large database with hundreds of products in, and you want to compare the prices of three or four of them, the Mini (and GSA) cannot do this. You need to change your shop so you can run comparisons using your standard database information.

Although you can put price or other information in to the Mini search results by having special meta tags on your product pages, it cannot sort the search results by anything except the standard relevancy, or by the date the page was last changed and spidered. You cannot force it in to sorting by price, or any other detail. If you want this, you need to update the search on your current database so it will sort by the price field in the SQL query.

If your shop search is working very slowly, try looking at:

  • Setting up indexes in the database to cover the most searched on fields
  • Upgrading your database (i.e. if you’re using Access, look at moving to MySQL or Microsoft SQL Server.)
  • Talk to your host about either moving to a higher grade of server, so the database runs more quickly, or upgrading to have a separate database server which is good enough to handle the load your website is putting on it.

The Mini is a good product, but it’s made for a particular set of circumstances, it would be a waste of money to buy it to do something it’s really not built to do, when you could use the same money elsewhere to get a proper solution.

How do I spider a protected area with a Google Mini?

January 30, 2006 in Google Mini,Q&A,Spidering | Comments (8)

If you have a protected area of your website, but still want to spider it and have the results available to searchers (e.g. if you have an extranet that you want to offer searching on) then you can give a Google Mini a username and password to give access to the area to spider.

For instance, you have used .htaccess and .htpasswd files to protect a directory on your website called ‘clientarea’, and you need to use the username ‘myuser’ and password ‘mypass’ to access it.

In the Mini’s Admin area, go in to ‘Configure Crawl’, if there isn’t a direct link from another area you are spidering, add the URL of the protected area to ‘Start Crawling from the Following URLs.’ Note: when I tried this, I had to add a direct link to a particular file, rather than just to the directory. So in this example we can add in:

http://webpositioningcentre.co.uk/clientarea/index.php

Click ‘Save URLs to Crawl’ and go to the ‘Crawler Access’ area.

In the area labelled ‘Users and Passwords for Crawling’ you need to put the URL of the protected area, and a username and password to access it. If you also need to set a domain for the protected area, fill in that box as well.

For URLs Matching Pattern: http://webpositioningcentre.co.uk/clientarea/
Use this user: myuser
With Password: mypass (same in Confirm Password)

Click on ‘Save Crawler Access Configuration’ and you’re ready to go. It will remove the password stars from the boxes, but will remember the password.

Next time you crawl your sites, it will access the protected area as if it was a user giving over the set username and password. In the search results it will show Titles, URLs and a snippet of the page as usual, but when a searcher clicks on the link they will need a correct username and password to access the area.

Warning: searchers will be able to click on the ‘Cached’ link and view the contents of the page. To stop the Google Mini caching the page, in the area of the page in question put the following code:

<meta name="robots" content="noarchive" />

This will stop the Mini, and any other search engine you allow access to the pages from storing a cached version of the page. They will still show part of the page in the snippet as part of the results, but searchers won’t receive a ‘cached’ link to click on.

How do I access the XML from the Google Mini / GSA?

January 26, 2006 in Google Mini,GSA,Q&A,XML API | Comments (8)

As well as the standard web interface, the Google Mini and Google Search Appliance have an XML interface which gives you a results set back in XML.

To access the XML, you use a scripting language to use HTTP GET with a particular URL:

For XML without a DTD:
http://www.miniaddress.com/search?q=searchphrase&output=xml_no_dtd &client=collectionname&site=collectionname

Where ‘www.miniaddress.com’ is the address of your search appliance (this can also be an IP address,) ‘collectionname’ is the name of your collection, and ‘searchphrase’ is what you are searching for.

If you want the DTD, change output=xml_no_dtd to output=xml

You can set lots of flags in the URL to do things like change the start number of the results set, or change the encoding of the results coming back to UTF-8 or latin. You can look up the various flags in the GSA XML reference.

How to spider hidden content

January 25, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (2)

This answers a question I’ve been asked in a couple of different ways…

Q: I have a page that uses many javascript popup windows for news articles, how can I index them?
Q: On a webserver I have a directory full of html pages. These pages are NOT listed anywhere. Can the google mini return search results where these html files will also be included?

If you have some pages that are usually off-limits to spiders, you can make sure your Search Appliance or Mini spiders them in a couple of ways:

1. Put the exact URL of each page to be spidered in the list of places to be spidered in the crawling admin – if you have many pages, this will become a maintenance problem.

2. Use a sitemap page which is not indexed by the appliance – the easy way to do it.

To do 2, make a list of all the pages you want spidered that are being missed out because there is not a direct route to them for the spider – i.e. Javascript is getting in the way, or the only route to them is blocked via robots.txt or something similar. This does not need to be a fancy page, it’s just a list for the spider to see and no people will ever need to see it.

In the HTML of this page, between the tags, put the following line:

<meta name=”robots” content=”noindex, follow” />

This means the spider will read the page and follow all links on it, but the page itself will not be indexed. If it isn’t indexed, it can’t be shown in the search results, so no-one can find it.

Now give the GSA or Mini the address of this sitemap page to crawl. Any time you add new pages to your site that are not getting spidered, you can add their address to the sitemap page.

NB: Google and the other big search engines also follow this ‘robots’ meta tag command, however you will need a link to the sitemap page from another part of your site for them to spider the pages and find them interesting enough to keep in their index, so if you use it to expose pages to the public search engines you will need your sitemap to look prettier as people might click through a link on your website to the page.

How does the GSA / Google Mini count documents?

January 8, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (0)

The Google Mini has a document limit of 100,000 documents, the Google Search Appliance between 500,000 and 2 million, although you can put several GSA’s together so you can raise that up higher.

Each ‘document’ is a page or file which has a unique address, so

d:\Project Weevils\documents\Squared burrows.doc
http://intranet/
http://webpositioningcentre.co.uk/index.php?article=5435345
and http://webpositioningcentre.co.uk/index.php?article=2947

are all documents. To a developer, the last two URLs may look like the same page, showing different content, to the GSA, that doesn’t matter – they are unique URLs, so they are different documents and will be counted separately.

So, with the Mini you get to index 100,000 individual addresses. If you have a dynamic site which shows all it’s content via variables on one address, then each page served dynamically on a different variable is counted as a separate document.

A problem with this is that if you are using multiple variables in your URLs and your page / CMS can show the same content under different URLs, you can have the same content spidered several times under different addresses – this is one variety of ‘spider trap.’ If this might happen to your site, you should look at trying to exclude it happening by giving a spiderable path to one set of the content and ignoring others using regular expressions. I’ll try to write more on that soon.

What ports does a Google Mini have?

in Google Mini,GSA,Q&A | Comments (0)

Although the Google Mini is a server, and is usually used through it’s web based interface, it has the follow ports available if you want to plug in to it directly:

  • Power (standard kettle lead)
  • PS/2 ports for keyboard and mouse
  • 2 x USB ports
  • Printer
  • VGA
  • 9 pin serial
  • 2 x ethernet ports

Google Mini back

Click on the picture to see a larger version