Comment on Avoiding session IDs when spidering by Paul
Fri, 14 Mar 2014 15:00:46 +0000
Hi Manu, I haven't used the latest version of the GSA software, but in the versions I have used, the GSA never sets the sessionId itself, it has always been created by the server it is spidering.

If the sessionId that has been assigned to the spider is not automatically put in to the links on the page it is crawling, when the GSA spiders each of those links it could be that your server is giving it a new sessionId every time, as it would not have the sessionId that was assigned to the GSA spider copied in to the URL.

So what I’m saying is – the way your pages are coded is most likely the reason you’re seeing lots of different sessionIds, it’s not the GSA creating new sessionIds deliberately.

It would be a lot easier if spiders would keep the same sessionId across a site, but then it’s up to us as developers to make our sites work that way, or just not set sessionIds where they’re not necessary. As spiders can’t tell what is changed on the site in reaction to their session, they are always going to prefer to spider without having a session set, as that’s the most likely state for a searcher to turn up at the site in, whether they use a GSA or other search engine to find the page.

All that said, it’s still very annoying to have to code around!

Comment on Avoiding session IDs when spidering by Manu Garg
Fri, 14 Mar 2014 04:23:16 +0000
Thanks for the info, this was very helpfull. But my question how does GSA creates and manages session when it does the crawling of url's. In other words the stuff explained above deals with the sessions in the destination url's. I've seen that GSA also creates new sessionId for each url , even though all the urls falls under a specific set of pattern. Isn't that a overhead on the spider as well. Won't it be convenient for the search engine if it crawls all the url with same sessionid (dotcomsid) for a single request, irrespective of number of url's to be crawled.

Comment on How do I access the XML from the Google Mini / GSA? by Sebastian Felix Schwarz
Thu, 30 Aug 2012 14:48:51 +0000
Hi, i have a problem getting the FULL XML from my GSA!
Putting my query in the Browser-Adress-Field it returns the right XML with all item. Using CURL only returns the META-Content WITHOUT the RES-Node.


What happend? I tried also: file_get_contents() … no success.

Comment on How do I access the XML from the Google Mini / GSA? by Paul
Tue, 17 Jul 2012 10:07:24 +0000
Hi Peter, I've updated the link to the new home for the GSA XML documentation

I do wish Google would put in re-directs for this stuff. You’d think they’d understand the need for that, what with being a search engine and suggesting other people do it.

Comment on How do I access the XML from the Google Mini / GSA? by Peter Knaggs
Mon, 16 Jul 2012 09:09:05 +0000
Both GSA XML reference links are broken.

Comment on Google Mini: Searching Subcollections from the frontend by Steve
Tue, 29 Nov 2011 04:22:11 +0000
This was exactly what I was looking for. I'm not very tech savvy but I managed this. Thanks.

Comment on When your GSA license runs out by Dave Watts
Wed, 01 Jun 2011 14:42:05 +0000
@ltman: it costs whatever you paid to purchase the license in the first place, more or less. The Google Search Appliance is licensed, not purchased.

Comment on Can you put new pages or applications on a Google Mini? by Stoett
Tue, 03 May 2011 18:09:46 +0000
I am saddened with the "No" answer but thank you for posting this. At least now, I found the answer to my question.

Comment on Custom meta tags in search results and full stops by Eric
Mon, 02 May 2011 23:51:06 +0000
Paul,

was wondering, do you know if you can specify an inmeta search… something like so:


and also specify that all pages that are missing the organization meta tag be included in the search?


Comment on Setting a unique user agent to help control spidering by Lalith
Thu, 14 Apr 2011 15:22:45 +0000
Hi
Thanks for the information!
I am trying to feed pdfs to GSA, the error I am getting is “Crawled with empty body: Conversion error.”

Can you help me resolving this issue?

Comment on When your GSA license runs out by Buzz
Tue, 08 Jul 2008 15:15:39 +0000
Well the GSA mini license _does_ run out … just not anytime soon

“License valid until: March 07, 9009″
“The license will expire in 2556939 days.”

Just intime for the 9010 Apocalypse, all hail our robot overlords … etc.

Comment on When your GSA license runs out by Itman
Tue, 15 Apr 2008 14:53:26 +0000
Hi,
does it cost 30 thousand dollars to renew the license?

Comment on You can't spider XML with a Google Mini (so far) by Jason Grovert
Mon, 04 Feb 2008 18:57:47 +0000
We create a sitemap.xml file which contains all the pages on our site. We have ~7000 pages so far. Is there a way we can configure Google Mini to load a page (that we create using the sitemap.xml file) that has all 7000 links, and the Google Mini goes and crawls just those 7000 pages? (And nothing else that it finds in those 7000 pages)?

Comment on Relevance in Mini and GSA searches by Joel
Thu, 10 Jan 2008 16:25:14 +0000
Currently nutch-IICE open source project is similar with Google GSA. You can take a look at it.

Comment on How do I access the XML from the Google Mini / GSA? by hazarth
Fri, 21 Dec 2007 10:42:00 +0000
i implemented gsa search functionality to my appilications
while searching with “testing” the output is not comming