|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.tools.PruneIndexTool
public class PruneIndexTool
This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped.
NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge
of available Lucene document fields is required. This can be obtained by reading sources
of index-basic
and index-more
plugins, or using tools
like Luke. During query parsing a
WhitespaceAnalyzer is used - this choice has been made to minimize side effects of
Analyzer on the final set of query terms. You can use Query.main(String[])
method to translate queries in Nutch syntax to queries in Lucene syntax.
If additional level of control is required, an instance of PruneIndexTool.PruneChecker
can
be provided to check each document before it's deleted. The results of all
checkers are logically AND-ed, which means that any checker in the chain
can veto the deletion of the current document. Two example checker implementations
are provided - PrintFieldsChecker prints the values of selected index fields,
StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can
be activated by providing respective command-line options.
The typical command-line usage is as follows:
PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title
This command will just print out fields of matching documents.
PruneIndexTool index_dir -queries queries.txt
This command will actually remove all matching entries, according to the queries read fromqueries.txt
file.
NOTE 2: This tool removes matching documents ONLY from segment indexes (or
from a merged index). In particular it does NOT remove the pages and links
from WebDB. This means that unwanted URLs may pop up again when new segments
are created. To prevent this, use your own URLFilter
,
or PruneDBTool (under construction...).
NOTE 3: This tool uses a low-level Lucene interface to collect all matching documents. For large indexes and broad queries this may result in high memory consumption. If you encounter OutOfMemory exceptions, try to narrow down your queries, or increase the heap size.
Nested Class Summary | |
---|---|
static class |
PruneIndexTool.PrintFieldsChecker
This checker's main function is just to print out selected field values from each document, just before they are deleted. |
static interface |
PruneIndexTool.PruneChecker
This interface can be used to implement additional checking on matching documents. |
static class |
PruneIndexTool.StoreUrlsChecker
This checker's main function is just to store the URLs of each document to be deleted in a text file. |
Field Summary | |
---|---|
static org.apache.commons.logging.Log |
LOG
|
static int |
LOG_STEP
Log the progress every LOG_STEP number of processed documents. |
Constructor Summary | |
---|---|
PruneIndexTool(File[] indexDirs,
Query[] queries,
PruneIndexTool.PruneChecker[] checkers,
boolean unlock,
boolean dryrun)
Create an instance of the tool, and open all input indexes. |
Method Summary | |
---|---|
static void |
main(String[] args)
|
static Query[] |
parseQueries(InputStream is)
Read a list of Lucene queries from the stream (UTF-8 encoding is assumed). |
void |
run()
For each query, find all matching documents and delete them from all input indexes. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final org.apache.commons.logging.Log LOG
public static int LOG_STEP
Constructor Detail |
---|
public PruneIndexTool(File[] indexDirs, Query[] queries, PruneIndexTool.PruneChecker[] checkers, boolean unlock, boolean dryrun) throws Exception
indexDirs
- directories with input indexes. At least one valid index must
exist, otherwise an Exception is thrown.queries
- pruning queries. Each query will be processed in turn, and the
length of the array must be at least one, otherwise an Exception is thrown.checkers
- if not null, they will be used to perform additional
checks on matching documents - each checker's method PruneIndexTool.PruneChecker.isPrunable(Query, IndexReader, int)
will be called in turn, for each matching document, and if it returns true this means that
the document should be deleted. A logical AND is performed on the results returned
by all checkers (which means that if one of them returns false, the document will
not be deleted).unlock
- if true, and if any of the input indexes is locked, forcibly
unlock it. Use with care, only when you are sure that other processes don't
modify the index at the same time.dryrun
- if set to true, don't change the index, just show what would be done.
If false, perform all actions, changing indexes as needed. Note: dryrun doesn't prevent
PruneCheckers from performing changes or causing any other side-effects.
Exception
Method Detail |
---|
public void run()
PruneIndexTool.PruneChecker
implementations.
run
in interface Runnable
public static void main(String[] args) throws Exception
Exception
public static Query[] parseQueries(InputStream is) throws Exception
NOTE: you may wish to use Query.main(String[])
method to translate queries from Nutch format to Lucene format.
is
- InputStream to read from
Exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |