Lucene

Overview
Crawling a website
Creating an index from the command line
Indexing XML documents
Extract text from a PDF document

Overview

There are two URL for the search screen relative to your publication: search-live/lucene to search the live area, search-authoring/lucene to search the authoring area of your publication.

If you want to customize the layout of the search screen for your publication, place a stylesheet at lenya/xslt/search/search-and-results.xsl relative to your publication root.

Lucene indices are stored within the work/search/index/$AREA/index directory of your publication. The work/search/htdocs_dump/$AREA directory holds content from crawling (see below).

The search pipelines are defined within global-sitemap.xmap and lucene.xmap

Crawling a website

Crawl a website by running


ant -f build/lenya/webapp/lenya/bin/crawl_and_index.xml -Dcrawler.xconf=build/lenya/webapp/lenya/pubs/default/config/search/crawler-live.xconf crawl

Note that there is a search.properties file in build/lenya/webapp/lenya/bin that you may have to change. crawler.xconf needs to have the following elements:


<crawler>
  <user-agent>lenya</user-agent>

  <base-url href="http://lenya.apache.org/index.html"/>
  <scope-url href="http://lenya.apache.org/"/>

  <uri-list src="work/search/lucene/uris.txt"/>
  <htdocs-dump-dir src="work/search/lucene/htdocs_dump/lenya.apache.org"/>

  <!-- <robots src="robots.txt" domain="lenya.apache.org"/> -->
</crawler>

user-agent is the HTTP user agent that will be used for the crawler
base-url is the start URL for the crawler
scope-url limits the scope of the crawl to that site, or subdirectory
uri-list is a reference to a file that will contain all URLs found during the crawl
htdocs-dump-dir specifies the directory that will contain the crawled site
robots specifies an (optional) robots file that follows the Robot Exclusion Standard

If you want to fine-tune the crawling (and do not have access to the remote server to put a robots.txt there), then you can specify exlusions in a local robots.txt file:


# lenya.apache.org

User-agent: *
Disallow: /there_seems_to_be_a_bug_within_websphinx_Robot_Exclusion.html

#Disallow:

User-agent: lenya
Disallow: /do/not/crawl/this/page.html

Creating an index from the command line


ant -f build/lenya/webapp/lenya/bin/crawl_and_index.xml -Dlucene.xconf=build/lenya/webapp/lenya/pubs/default/config/search/lucene-live.xconf index

Note that there is a search.properties file in build/lenya/webapp/lenya/bin that you may have to change. lucene-live.xconf has the following elements


<lucene>
  <update-index type="new"/>
  <!--
  <update-index type="incremental"/>
  -->

  <index-dir src="../../work/search/lucene/index/index"/>
    <htdocs-dump-dir src="../../work/search/lucene/htdocs_dump"/>

    <indexer class="org.apache.lenya.lucene.index.DefaultIndexer"/>
</lucene>

Indexing XML documents

In order to index XML documents one needs to configure the org.apache.lenya.lucene.index.ConfigurableIndexer (see above).

With namespaces:


<?xml version="1.0"?>

<luc:document xmlns:luc="http://apache.org/cocoon/lenya/lucene/1.0">
  <luc:field name="currwfstate" type="Text" xpath="/wf:history/wf:version[last()]/@state">
    <namespace prefix="wf">http://apache.org/cocoon/lenya/workflow/1.0</namespace>
  </luc:field>
</luc:document>

Concatenating element values and setting default values in case element value doesn't exist:


<?xml version="1.0"?>

<luc:document xmlns:luc="http://apache.org/cocoon/lenya/lucene/1.0">
  <luc:field name="title" type="Text" xpath="/article/head/title"/>
  <luc:field name="subtitle" type="Text" xpath="/article/head/subtitle"/>
  <luc:field name="lead" type="UnStored" xpath="/article/head/abstract"/>
  <luc:field name="contents" type="UnStored" xpath="/"/>
  <luc:field name="author" type="UnStored"/>
    <namespace prefix="lenya">http://apache.org/cocoon/lenya/page-envelope/1.0</namespace>
    <namespace prefix="dc">http://purl.org/dc/elements/1.1/</namespace>
    <xpath>/*/lenya:meta/dc:contributor</xpath>
  </luc:field>
  <luc:field name="date" type="Text">
    <namespace prefix="lenya">http://apache.org/cocoon/lenya/page-envelope/1.0</namespace>
    <xpath default="1969">/*/lenya:meta/year</xpath><text>.</text><xpath default="02">/*/lenya:meta/month</xpath><text>.</text><xpath default="16">/*/lenya:meta/day</xpath>
  </luc:field>
</luc:document>

Extract text from a PDF document


ant -f build/lenya/webapp/lenya/bin/crawl_and_index.xml -Dhtdocs.dump.dir=build/lenya/webapp/lenya/pubs/default/work/search/lucene/htdocs_dump xpdf

Also see the targets pdfbox and pdfadobe.