How does the GSA / Google Mini count documents?

January 8, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (0)

The Google Mini has a document limit of 100,000 documents, the Google Search Appliance between 500,000 and 2 million, although you can put several GSA’s together so you can raise that up higher.

Each ‘document’ is a page or file which has a unique address, so

d:\Project Weevils\documents\Squared burrows.doc

are all documents. To a developer, the last two URLs may look like the same page, showing different content, to the GSA, that doesn’t matter – they are unique URLs, so they are different documents and will be counted separately.

So, with the Mini you get to index 100,000 individual addresses. If you have a dynamic site which shows all it’s content via variables on one address, then each page served dynamically on a different variable is counted as a separate document.

A problem with this is that if you are using multiple variables in your URLs and your page / CMS can show the same content under different URLs, you can have the same content spidered several times under different addresses – this is one variety of ‘spider trap.’ If this might happen to your site, you should look at trying to exclude it happening by giving a spiderable path to one set of the content and ignoring others using regular expressions. I’ll try to write more on that soon.

Comments (0)

RSS feed for comments on this post.

Leave a comment

Sorry, the comment form is closed at this time.