org.apache.nutch.indexer
Class IndexUtil
java.lang.Object
org.apache.nutch.indexer.IndexUtil
public class IndexUtil
- extends Object
Utility to create an indexed document from a webpage.
Constructor Summary |
IndexUtil(org.apache.hadoop.conf.Configuration conf)
|
Method Summary |
NutchDocument |
index(String key,
WebPage page)
Index a Webpage , here we add the following fields:
id: default uniqueKey for the NutchDocument .
digest: Digest is used to identify pages (like unique ID) and is used to remove
duplicates during the dedup procedure. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
IndexUtil
public IndexUtil(org.apache.hadoop.conf.Configuration conf)
index
public NutchDocument index(String key,
WebPage page)
- Index a
Webpage
, here we add the following fields:
- id: default uniqueKey for the
NutchDocument
.
- digest: Digest is used to identify pages (like unique ID) and is used to remove
duplicates during the dedup procedure. It is calculated using
MD5Signature
or
TextProfileSignature
.
- batchId: The page belongs to a unique batchId, this is its identifier.
- boost: Boost is used to calculate document (field) score which can be used within
queries submitted to the underlying indexing library to find the best results. It's part of the scoring algorithms.
See scoring.link, scoring.opic, scoring.tld, etc.
- Parameters:
key
- The key of the page (reversed url).page
- The Webpage
.
- Returns:
- The indexed document, or null if skipped by index filters.
Copyright © 2013 The Apache Software Foundation