The license has expired. You are in the grace period. The software will stop crawling, indexing, and serving in 14 days. Please contact Google to extend your license.
The GSA will keep working normally for two weeks, giving you a chance to get a new license from Google Enterprise to continue it’s service, or finalise plans for moving away from it if you’re not going to get a new license.
Please note: the Google Mini does not have a license that runs out. Once the warrenty period is over you it’ll keep running as long as the hardware holds up.
]]>If you’re not sure what age Mini you have, there’s a test code on that page you can use to check your Mini. The MID series have ‘MID’ in their user agent when spidering, which might also help you check.
The M2 series, which has been on sale since last summer, and the Google Search Appliance, are not vulnerable to the problem.
If your Mini is public facing, you should patch it straight away. If you only use the XML feed and show results through other code, it’s up to you whether you patch it or not, you’re less at risk of someone using it for nefarious means.
]]>The Mini / Search Appliance can read the XML, but it takes it in as straight text, so any searching you do will look at node names, attributes and content, rather than just content.
The best I can suggest is you have some scripting to run an XSL transform on your XML to turn it in to a small (or indeed large) site of web pages, then spider those with the appliance.
]]>The XSLT controlling the look of the web frontend on the box automatically replaces the title with the URL of the page instead (with the http:// taken off the start.) With the XML API you can decide to replace it with anything you like, but this behaviour is certainly preferred by the clients I’ve had to set it up for. Best of all would be for all pages to have a title, but there could well be some that slip through testing (when there is testing) so it’s best to be prepared for it.
]]>Note: This can take a few minutes to have an effect on the frontend, so if you don’t see a change to your search page immediately, just hang on and try again in a few minutes.
If you were just trying to turn on the menu of subcollections, that’s all you need to do. However, if you’re trying to get your new subcollection to show, you need to do these two steps again:
That will cause the menu to update. You don’t need to wait for the frontend to refresh when doing this, just go through the first steps to switch the menu off, save it, then re-tick the box and save again.
]]>Oddly enough, it’s hosting the site for the awards, but nominations are separate for that so there shouldn’t be any shenannigans going on.
I worked on p2b last summer, writing sets of code to query the GSA and various other tasks. Hopefully it will win it’s category of ‘Best community site.’ If it does, I’ll be there to see it as I’m presenting an award from the freelancers networking group I run, the Brighton Farm.
]]>/export/hda3/4.3.105.M.6/local/conf/frontends/default_Frontend/domain_filter (No such file or directory)
When trying to use the XML API to get the results back. I tried various things and eventually contacted the host to make sure their setup wasn’t blocking anything. Eventually I realised the Collection and Front End settings are case sensitive and I had mine in all-lowercase as I’d been told, whereas the Mini was actually set up with capitalised first letter. Once I’d matched what they were in the Mini, the error stopped. You can also get the same error with the Google Search Appliance.
Simple, but so easy to get caught out by it!
]]>It’s a search engine that covers educational courses for 14-19 year olds in Brighton and Hove (the area of the UK I live in.) I’ve written up what I did for the project in on my freelancing site.
We’ve used custom meta tags on the school’s websites to allow an advanced search, and also to feed information in to a database powering the ‘Pathways’ system, which leads to customised searching on the schools courses.
I’d like to thank the team I worked with on the site, who made it an enjoyable project. Here’s hoping a lot of students find it useful for finding the courses they should be doing next to help their future career.
]]>Surround the content you want ignored with the following tags:
<!-- googleoff: index --> <!-- googleon: index -->
So if you have
<!-- googleoff: index --> I like bees <!-- googleon: index -->
On your page and you search for ‘bees’, it won’t come up, even if the page has been spidered. The only people who will find out about your love of buzzing insects will be those who have found the page through other means.
This can be useful for excluding parts of your page that the appliance might find confusing, for instance ‘H’ wants to exclude his breadcrumb trail.
]]>You can put a little code of your own before or after your meta tag’s name to make it unique for your project. This is like ‘namespaces’ in programming – where you try to keep your variables separate from anything that might conflict with them and over-write them with different data. For instance the Dublin Core project puts ‘DC.’ in front of their names, so you know what standard it’s related to. So instead of…
<meta name=”Publisher” content=”Web Positioning Centre” />
you have:
<meta name=”DC.Publisher” content=”Web Positioning Centre” />
Letting you know they are working within Dublin Core standards, and it’s unlikely any page is all ready using a tag called ‘DC.Publisher’, whereas it could be using ‘Publisher’ on it’s own.
If you’re setting up your own meta tags for use with a Mini or GSA, do not use full stops, ‘.’, to separate your code from the general name. When you pull back the results, they use full stops to separate different tags that you want to bring back using the ‘getfields’ flag.
So if you wanted to bring back the information in ‘DC.Publisher’ with the rest of the search results data, it will actually try to bring back information from the meta tag named ‘DC’ and another tag named ‘Publisher’
To avoid this happening, use something else to separate your namespace code (your ‘DC’) from the rest of the name. It would be a good idea not to use anything that needs to be ‘URL escaped‘ which pretty much limits you down to the following: $-_+!*'() – personally I tend to use a hyphen, ‘-‘, as it’s quite readable and is unlikely to cause problems in programming, unlike 4 or ()
Your meta information with your namespace code could look something like this:
<meta name=”gsad-site” content=”Spidertest” />
<meta name=”gsad-author” content=”Web Positioning Centre” />
<meta name=”gsad-image” content=”http://www.spidertest.com/images/wpc-logo.gif” />
Then you can get back these fields in your search results by using:
&getfields=gsad-site.gsad-author.gsad.image
In the XML of your results, you will get these additional fields:
<MT N=”gsad-site” V=”Spidertest”/>
<MT N=”gsad-author” V=”Web Positioning Centre”/>
<MT N=”gsad-image” V=”http://www.spidertest.com/images/wpc-logo.gif”/>
Within the box you get a list of:
Total Inflight URLs – these are links it has found, but not read the pages of yet.
Total Crawled URLs – pages that have been read
Locally Crawled URLs – this may come up when you’re re-spidering. It lists pages which have not changed since the last spidering, so it just reads it’s internal copy which saves reading the live pages again, saving time and bandwidth.
Excluded URLs – these are pages that have been excluded from being read for some reason, either via a robots.txt exclusion, or because of one of the rules you have set in the ‘Configure Crawl’ section.
All of these labels are links, if you click on them you get a list of the various documents in each section. Clicking on ‘Total Crawled URLs’ gives you a list of pages normally labelled ‘New Document’, within ‘Locally Crawled URLs’ they will be labelled ‘Cached Version.’
What makes a ‘New Document’?
By default, the Google Mini will look at the ‘Last-Modified’ header of the page. If it is more recent than the last spidered version of the page that it has stored, it will read and index the live page.
If your pages are straight HTML, then Last-Modified will be when they were created on the server – so when they were uploaded by FTP in most cases. However, if your pages are dynamic, for instance PHP or ASP pages, and use information from a database or even include files, then their Last-Modified date will usually be the moment they were served up to the Mini, or anyone visiting your site. This means the Mini will read all of the pages, even if they haven’t changed since it last spidered them, as the web server is telling them it’s a recently changed page because of the date. This does not hurt your site in any way, but it means you will use up more time and bandwidth in your spidering as it could be reading pages which haven’t changed since they were last spidered.
Finding your pages ‘Last Modified’ date
If you use Firefox, you can install the Web Developer Toolbar for it. When you are looking at a page, click on ‘Information’ then ‘View Response Headers’ and you’ll see the ‘Last-Modified’ date being reported to your browser and any passing web spiders.
]]>Unfortunately sessions in the URL can upset spidering – a Google Search Appliance or Mini will generally up several ‘connections’ to a web site when it is spidering, this is like having several independent people browsing the site at the same time. Each of these connections receives a different session ID, which makes the URLs look different to the spiders. This in turn means each connection may spider the same pages that have all ready been covered. Also, if the session times out it may be replaced by a new session when the next page is spidered, which means that again the spider will re-read pages it has all ready found. This is because this:
/cars.php?phpsessid=6541231687865446
And this:
/cars.php?phpsessid=6541231AQ7J865KLP
Look like different pages, even though they may turn out to have the same content. To avoid this happening, you can stop the spider reading pages which have session IDs in the URL. You can avoid the most common session IDs by adding these lines to the ‘Do Not Crawl URLs with the Following Patterns:’ section of ‘URLs to Crawl':
contains:PHPSESSID
contains:ASPSESSIONID
contains:CFSESSION
contains:session-id
The web sites you are spidering may still contain session IDs, it is worth checking with the site owner if this is going to be a problem, and keep an eye on the ‘Inflight URLs’ shown in ‘System Status’ – ‘Crawl Status’ when spidering a site for the first time. If the same URLs are turning up a lot, you may have a session problem. You’ll need to stop spidering the site and work out which bit of the URL you need to ignore, then you can add it to do not crawl list like the examples above.
]]>The Mini can only show one design of results on it’s own, i.e. through the normal results page it shows, based on the XSL you can set up in the ‘Configure Serving -> Output Format’ section of the admin area. However, as long as you have a scripting language on your web server (e.g. PHP, ASP, ColdFusion, Perl) you can use the XML interface to get the results back, then change the way they look in one of two ways:
These give you the flexibility of being able to choose a method that best suits you or your developers, and gives you a lot of control over the look and feel of search results.
]]>The answer to both questions is: no, you cannot put your own web pages on the Mini (nor on the Google Search Appliance) and you can’t install your own applications either. They have a web based interface where you can change the look of their own pages somewhat, but beyond that you can’t do anything else. There’s no way of uploading your own pages on to the box, and there’s no way of installing your own applications.
Basically, these things are a plug-and-play box, they don’t like being fiddled with, and although that’s probably rather different from what most webmasters are used to, it does allow Google to be confident about how the box will work, and to avoid support costs covering people who install leaky web apps on to the server and then complain when it eventually crashes.
]]>It is very good at searching through lots of unstructured information, and finding the document (page) that best matches what you are searching for.
So, it is good for searching all your old sales documents on your intranet, to find the reference you made to ‘Singing badgers’
Or it’s good at searching all the articles on your website, for all the ‘orange-spotted tapirs’ that you have written medical items on.
It is not good at comparing pages for specific pieces of information, or for sorting by anything except relevancy or date the page was made.
For instance, if you have a large database with hundreds of products in, and you want to compare the prices of three or four of them, the Mini (and GSA) cannot do this. You need to change your shop so you can run comparisons using your standard database information.
Although you can put price or other information in to the Mini search results by having special meta tags on your product pages, it cannot sort the search results by anything except the standard relevancy, or by the date the page was last changed and spidered. You cannot force it in to sorting by price, or any other detail. If you want this, you need to update the search on your current database so it will sort by the price field in the SQL query.
If your shop search is working very slowly, try looking at:
The Mini is a good product, but it’s made for a particular set of circumstances, it would be a waste of money to buy it to do something it’s really not built to do, when you could use the same money elsewhere to get a proper solution.
]]>