You can’t spider XML with a Google Mini (so far)

January 16, 2007 in Google Mini,GSA | Comments (1)

A question I’ve seen come up a lot which isn’t answered directly by my earlier post is whether the Google Mini or Search Appliance can spider raw XML. Unfortunately, no, it cannot.

The Mini / Search Appliance can read the XML, but it takes it in as straight text, so any searching you do will look at node names, attributes and content, rather than just content.

The best I can suggest is you have some scripting to run an XSL transform on your XML to turn it in to a small (or indeed large) site of web pages, then spider those with the appliance.

  1. Comment by Jason Grovert — February 4, 2008 @ 6:57 pm

    We create a sitemap.xml file which contains all the pages on our site. We have ~7000 pages so far. Is there a way we can configure Google Mini to load a page (that we create using the sitemap.xml file) that has all 7000 links, and the Google Mini goes and crawls just those 7000 pages? (And nothing else that it finds in those 7000 pages)?

