org.apache.nutch.tools
Class PruneIndexTool

java.lang.Object
  extended by org.apache.nutch.tools.PruneIndexTool
All Implemented Interfaces:
Runnable

public class PruneIndexTool
extends Object
implements Runnable

This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped.

NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax.
If additional level of control is required, an instance of PruneIndexTool.PruneChecker can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - PrintFieldsChecker prints the values of selected index fields, StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options.

The typical command-line usage is as follows:

PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title
This command will just print out fields of matching documents.
PruneIndexTool index_dir -queries queries.txt
This command will actually remove all matching entries, according to the queries read from queries.txt file.

NOTE 2: This tool removes matching documents ONLY from segment indexes (or from a merged index). In particular it does NOT remove the pages and links from WebDB. This means that unwanted URLs may pop up again when new segments are created. To prevent this, use your own URLFilter, or PruneDBTool (under construction...).

NOTE 3: This tool uses a low-level Lucene interface to collect all matching documents. For large indexes and broad queries this may result in high memory consumption. If you encounter OutOfMemory exceptions, try to narrow down your queries, or increase the heap size.

Author:
Andrzej Bialecki <ab@getopt.org>

Nested Class Summary
static class PruneIndexTool.PrintFieldsChecker
          This checker's main function is just to print out selected field values from each document, just before they are deleted.
static interface PruneIndexTool.PruneChecker
          This interface can be used to implement additional checking on matching documents.
static class PruneIndexTool.StoreUrlsChecker
          This checker's main function is just to store the URLs of each document to be deleted in a text file.
 
Field Summary
static org.apache.commons.logging.Log LOG
           
static int LOG_STEP
          Log the progress every LOG_STEP number of processed documents.
 
Constructor Summary
PruneIndexTool(File[] indexDirs, Query[] queries, PruneIndexTool.PruneChecker[] checkers, boolean unlock, boolean dryrun)
          Create an instance of the tool, and open all input indexes.
 
Method Summary
static void main(String[] args)
           
static Query[] parseQueries(InputStream is)
          Read a list of Lucene queries from the stream (UTF-8 encoding is assumed).
 void run()
          For each query, find all matching documents and delete them from all input indexes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG

LOG_STEP

public static int LOG_STEP
Log the progress every LOG_STEP number of processed documents.

Constructor Detail

PruneIndexTool

public PruneIndexTool(File[] indexDirs,
                      Query[] queries,
                      PruneIndexTool.PruneChecker[] checkers,
                      boolean unlock,
                      boolean dryrun)
               throws Exception
Create an instance of the tool, and open all input indexes.

Parameters:
indexDirs - directories with input indexes. At least one valid index must exist, otherwise an Exception is thrown.
queries - pruning queries. Each query will be processed in turn, and the length of the array must be at least one, otherwise an Exception is thrown.
checkers - if not null, they will be used to perform additional checks on matching documents - each checker's method PruneIndexTool.PruneChecker.isPrunable(Query, IndexReader, int) will be called in turn, for each matching document, and if it returns true this means that the document should be deleted. A logical AND is performed on the results returned by all checkers (which means that if one of them returns false, the document will not be deleted).
unlock - if true, and if any of the input indexes is locked, forcibly unlock it. Use with care, only when you are sure that other processes don't modify the index at the same time.
dryrun - if set to true, don't change the index, just show what would be done. If false, perform all actions, changing indexes as needed. Note: dryrun doesn't prevent PruneCheckers from performing changes or causing any other side-effects.
Throws:
Exception
Method Detail

run

public void run()
For each query, find all matching documents and delete them from all input indexes. Optionally, an additional check can be performed by using PruneIndexTool.PruneChecker implementations.

Specified by:
run in interface Runnable

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

parseQueries

public static Query[] parseQueries(InputStream is)
                            throws Exception
Read a list of Lucene queries from the stream (UTF-8 encoding is assumed). There should be a single Lucene query per line. Blank lines and comments starting with '#' are allowed.

NOTE: you may wish to use Query.main(String[]) method to translate queries from Nutch format to Lucene format.

Parameters:
is - InputStream to read from
Returns:
array of Lucene queries
Throws:
Exception


Copyright © 2006 The Apache Software Foundation