Comments on: Avoiding session IDs when spidering

By: Paul

Paul — Fri, 14 Mar 2014 15:00:46 +0000

Hi Manu, I haven’t used the latest version of the GSA software, but in the versions I have used, the GSA never sets the sessionId itself, it has always been created by the server it is spidering.

If the sessionId that has been assigned to the spider is not automatically put in to the links on the page it is crawling, when the GSA spiders each of those links it could be that your server is giving it a new sessionId every time, as it would not have the sessionId that was assigned to the GSA spider copied in to the URL.

So what I’m saying is – the way your pages are coded is most likely the reason you’re seeing lots of different sessionIds, it’s not the GSA creating new sessionIds deliberately.

It would be a lot easier if spiders would keep the same sessionId across a site, but then it’s up to us as developers to make our sites work that way, or just not set sessionIds where they’re not necessary. As spiders can’t tell what is changed on the site in reaction to their session, they are always going to prefer to spider without having a session set, as that’s the most likely state for a searcher to turn up at the site in, whether they use a GSA or other search engine to find the page.

All that said, it’s still very annoying to have to code around!

By: Manu Garg

Manu Garg — Fri, 14 Mar 2014 04:23:16 +0000

Thanks for the info, this was very helpfull. But my question how does GSA creates and manages session when it does the crawling of url’s. In other words the stuff explained above deals with the sessions in the destination url’s. I’ve seen that GSA also creates new sessionId for each url , even though all the urls falls under a specific set of pattern. Isn’t that a overhead on the spider as well. Won’t it be convenient for the search engine if it crawls all the url with same sessionid (dotcomsid) for a single request, irrespective of number of url’s to be crawled.

By: Paul

Paul — Thu, 15 Feb 2007 11:07:04 +0000

If you don’t exclude the session ID, it will spider the pages. I suggest you set the host load to be 1, which might help it keep to the same session ID as it effectively only sends one connection to spider the site rather than 4 (the default host load.)

You can’t get it to ignore part of the URL, it only understands inclusion and exclusion based on parameters.

If you have to have a session ID, then you may need to look at changing the CMS or whatever runs the site so it will always feed the same session ID to the Mini when it is spidering.

Basically, spiders hate session IDs, so if you have to have one, you’re always going to be in a bit of trouble. Hmm… you could set up a page of links to every page on your site, all with a session ID appended to them, then set that page as the place the Mini starts spidering. That might then allow it access to the whole site without getting too confused. Unfortunately this is just guesswork from me, usually I deal with people who can get rid of the sessions, or places where we exclude them entirely.

By: Ty C.

Ty C. — Wed, 14 Feb 2007 23:37:55 +0000

What if the website requires the session ID? Is there a way to tell GSA to ignore a specific querystring parameter but still index the page? Otherwise the entire site will be ignored, won’t it?