How do I spider a protected area with a Google Mini?

January 30, 2006 in Google Mini,Q&A,Spidering | Comments (8)

If you have a protected area of your website, but still want to spider it and have the results available to searchers (e.g. if you have an extranet that you want to offer searching on) then you can give a Google Mini a username and password to give access to the area to spider.

For instance, you have used .htaccess and .htpasswd files to protect a directory on your website called ‘clientarea’, and you need to use the username ‘myuser’ and password ‘mypass’ to access it.

In the Mini’s Admin area, go in to ‘Configure Crawl’, if there isn’t a direct link from another area you are spidering, add the URL of the protected area to ‘Start Crawling from the Following URLs.’ Note: when I tried this, I had to add a direct link to a particular file, rather than just to the directory. So in this example we can add in:

http://webpositioningcentre.co.uk/clientarea/index.php

Click ‘Save URLs to Crawl’ and go to the ‘Crawler Access’ area.

In the area labelled ‘Users and Passwords for Crawling’ you need to put the URL of the protected area, and a username and password to access it. If you also need to set a domain for the protected area, fill in that box as well.

For URLs Matching Pattern: http://webpositioningcentre.co.uk/clientarea/
Use this user: myuser
With Password: mypass (same in Confirm Password)

Click on ‘Save Crawler Access Configuration’ and you’re ready to go. It will remove the password stars from the boxes, but will remember the password.

Next time you crawl your sites, it will access the protected area as if it was a user giving over the set username and password. In the search results it will show Titles, URLs and a snippet of the page as usual, but when a searcher clicks on the link they will need a correct username and password to access the area.

Warning: searchers will be able to click on the ‘Cached’ link and view the contents of the page. To stop the Google Mini caching the page, in the area of the page in question put the following code:

<meta name="robots" content="noarchive" />

This will stop the Mini, and any other search engine you allow access to the pages from storing a cached version of the page. They will still show part of the page in the snippet as part of the results, but searchers won’t receive a ‘cached’ link to click on.

Comments (8)

RSS feed for comments on this post.

  1. Comment by Jason Divis — August 23, 2006 @ 3:28 pm

    Is this the same process that I would go through if one has to login to our site (ASP based)? I have specified the user name and password that I use to login to our site, and the pattern of the URL to where our login page is, but it doesn’t look like Mini is able to gain access to our secure site. It only crawls the 10 pages on the non-secure side, and is redirected when it tries to crawl the secure pages…

  2. Comment by Paul — August 23, 2006 @ 4:06 pm

    I think the process I’ve listed only works with standard user/password combinations as set up by the web server (e.g. Apache or IIS). To get through your security, I think you’ll need to look at the HTTP Headers bit of ‘Crawl and Index’ (Mini v2) or ‘Configure Crawl’ -> ‘Crawler Parameters’ (in Mini v1). You can use this to either send a cookie through with the spider, or some special bit of information which matches some code on your pages and will let it in without needing the standard cookie set when people login.

    I do need to look at this some more and write up an example, I just haven’t had time recently as I’ve been busy on client work.

  3. Comment by Jason Divis — August 23, 2006 @ 5:41 pm

    Thanks for pointing me in the right direction. If I can get to the header before you, I’ll have to post my solution here.

  4. Comment by Anon — December 20, 2006 @ 5:25 pm

    By entering in user/password for the crawler to use, is the password encrypted in anyway when stored?

  5. Comment by Paul — December 20, 2006 @ 5:49 pm

    When you put the password in, it goes in to a ‘password’ type field, so it’s shown as *s. Once the password is stored, it is not shown in the form again (i.e. when checking the settings or editing them) so you can’t get it out that way. The username is shown.

    There’s no way of getting to the information on the hard disk unless you take the Mini apart and plug the hard disk in to another system, so I have no idea how it stores the passwords.

    Depending on the authentication system, it may need to send the passwords to the site as clear text, as it won’t be able to send them encrypted if the site is expecting to receive them unencrypted. So if it does store the passwords encrypted, it will not be in a secure one-way system.

  6. Comment by Fatimah — May 24, 2007 @ 2:18 pm

    Did anyone fine a solution to this?

  7. Comment by shauna durham — July 5, 2007 @ 8:39 pm

    I have set up the username and password for the crawl and it is working however because it is logging into our login page instead of the name of the page showing, which is what I want, it is showing “Member login…” for all the protected areas of our site. Not very user friendly. Does anyone know how to fix this problem?

  8. Comment by Ed — October 17, 2007 @ 3:50 pm

    I want the password-protected results to show up in the results, with an icon showing it is Members Only contetn. Is this possible with the Google Mini.

    thanks

Leave a comment

Sorry, the comment form is closed at this time.