=head1 NAME perl.apache.org Site Indexing and Search Setup =head1 Description This document explains how to setup swish-e, index and search the perl.apache.org site. Also how to setup search options. =head1 Setting up search options To setup search options which allow to search only specific subsections, modify I and run: % cd src/search % ./make.pl then commit I and the autogenerated files I and I. The docs inside I provide the rest of the details. =head1 Setting up swish-e =over =item 1 Install the dev version of swish-e. Currently we use SWISH-E 2.1-dev-25. =item 2 Make sure that swish-e is in the PATH, so the apps will be able to find it =back =head1 Indexing Normally build the site: % bin/build -f (-d to build pdfs) which among other things creates the dir: I Now run: % bin/makeindex This script is already adapted for the production machine of perl.apache.org. If you are doing it elsewhere you need to set an environment variable to the path of the site: export MODPERL_SITE='http://perl.apache.org' or export MODPERL_SITE='http://localhost:4000/dst_html' tcsh: setenv MODPERL_SITE http://perl.apache.org This is used as the base for spidering, plus is used to determine the sections of the site (for limiting the site to those sections, see below) Now you can manually spider the site if you didn't use the script already. Index the site % cd dst_html/search % swish-e -S prog -c swish.conf You should see something like: Indexing Data Source: "External-Program" Indexing "./spider.pl" ./spider.pl: Reading parameters from 'default' Summary for: http://localhost/modperl-site/ Duplicates: 5,357 (281.9/sec) Off-site links: 1,851 (97.4/sec) Total Bytes: 8,107,112 (426690.1/sec) Total Docs: 351 (18.5/sec) Unique URLs: 419 (22.1/sec) Removing very common words... no words removed. Writing main index... Sorting words ... Sorting 10599 words alphabetically Writing header ... Writing index entries ... Writing word text: Complete Writing word hash: Complete Writing word data: Complete 10599 unique words indexed. 5 properties sorted. 351 files indexed. 8107112 total bytes. 307356 total words. Elapsed time: 00:00:20 CPU time: 00:00:02 Indexing done! Now you can search... =head1 Searching =over =item 1 Go to the search page: ..../search/search.html =item 2 Search If something doesn't work check the I file on the server the I is running on. The most common error is that the swish-e binary cannot be found by the I script. Remember that CGI may be running under a different username and therefore may not have the same PATH env variable. =back =head1 Swish-e related adjustments to the templates =item * Since we want to index only the real content, we use: , only content here will indexed , =item * Since we want to be able to search any sub-section of the site, the search form includes the hidden variable C (mnemonics: 'search by meta'). For example: will search all the documents under I directory. the correct value for the C variable are set in the template when the site is created. The main search page I, has multiply checkboxes for the for the C variable so you can limit searches to only selected sections. The C<$ENV{MODPERL_SITE}> mentioned earlier is matched against the C variable to extract only the wanted subsets of the hits: $uri =~ m!$ENV{MODPERL_SITE}{/([^/]+)/.+$! where C<$1> is used as the section name. So it's just using the initial directory name for the section. =back =head1 How does indexing work Swish is run with a config file, and is run in a mode that says to use an external program to fetch documents. That external program is called I (part of the swish-e distribution). I uses a config file (by default) of I. This file builds an array of hashes (in this case a single hash in the array). This hash is the config. Part of the config are call-back functions that spider.pl will call while spidering. One says to skip image files. Another one is a bit more tricky. It splits a document into sections, creates new "sub-pages" that are complete HTML pages, and calls the function in spider.pl that sends those off to swish for indexing. (That function then returns false to tell swish not to index that document since the sections have already been indexed.) That's about it. One trick. For debugging you can run the spider without indexing. ./spider.pl > bigfile.out Another trick, you can send SIGHUP to I while indexing and it will stop spidering, but let swish index what's been read so far. =cut