Fork me on GitHub

Microsoft Exchange

Preparing the Exchange server

To crawl Exchange you normally have change to tings on the Exchange server.

  • Webdav. The ES uses webdav to connect to Exchange, so webdav has to be installed. This is the same access method the Entorage on Mac are using. A god guide on how to set it up are available her Accessing Exchange 2007 from your Apple Macintos.
  • The ES need the user right "Receive As" on all mailboxes it is to crawl. You can ether setup this for each user, or set the permission on the mailbox store to affect all users.

Setting up the ES part

Go to add manually in the Collections/Resources menu and select the Exchange connector.

Add exchange collection

  • Select or add a new usernam for the crawler to use when accessing the Exchange server.
  • Set the Exchange server address. Normal this is the hostname or ip-address to the Exchange server.
  • Select whits users mailboxes to crawl. This list is populated from your Active directory, and thus require a working user system setup.

Add exchange collection next

Excluding files and folders from a crawl

None of the Searchdaimon ES crawlers have a specific function for excluding resources from being crawled. This is done by purpose. We feel that it would be a bad practice to setup this from the Searchdaimon ES administrator interface. If you ever want to change search engine you would then have to set it up again. It is also much easier to audit security when all access information is kept local on the original server.

To exclude resources from being crawled use a dedicated user for the Searchdaimon ES and add a disallow acl on the resource for the user you are using to crawl with instead.

This may be a lite more cumbersome then hawing a disallow list in the Searchdaimon ES administrator interface, but it is less prone to bugs and make it easier to setup other search engines, backup systems and business intelligence systems should the need arise later.

Examples

For most system

Use a dedicated user for the Searchdaimon ES. Add a disallow acl for that user on what you want to exclude.

So for smb file shares

Use a dedicated user for the Searchdaimon ES. On folders and file you don't wan to index, select Properties from the Windows file explorer and add a disallow right for the Searchdaimon ES user at the security tab.

For web servers

Make a robots.txt file, and add a disallow for sdbot.

Example:
User-agent: sdbot
    Disallow: /folder/

More information about the robots.txt standard is available at http://www.robotstxt.org/robotstxt.html .

For folders on web servers

To prevent directory listings to appear in the search results you may want to block indexing of folder. To do this, use a meta tag like this:

<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">

With the Apache web server the best way is to creat .htaccess and HEADER.html files.

Add the following to the .htaccess file:

Options +Indexes
IndexOptions +FancyIndexing +SuppressHTMLPreamble +SuppressColumnSorting
HeaderName HEADER.html

Exsample of HEADER.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
  <title>Index</title>
  <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
</head>
<body>

This will create directory listings that are easy to browse both for people and crawlers. More information on this directives are available from the Apache Module mod_autoindex manual at http://httpd.apache.org/docs/2.2/mod/mod_autoindex.html .

<< Back to documentation overview

Copyright © Searchdaimon AS. All rights reserved.