org.apache.nutch.indexer
Class IndexUtil

java.lang.Object
  extended by org.apache.nutch.indexer.IndexUtil

public class IndexUtil
extends Object

Utility to create an indexed document from a webpage.


Constructor Summary
IndexUtil(org.apache.hadoop.conf.Configuration conf)
           
 
Method Summary
 NutchDocument index(String key, WebPage page)
          Index a Webpage, here we add the following fields: id: default uniqueKey for the NutchDocument. digest: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

IndexUtil

public IndexUtil(org.apache.hadoop.conf.Configuration conf)
Method Detail

index

public NutchDocument index(String key,
                           WebPage page)
Index a Webpage, here we add the following fields:
  1. id: default uniqueKey for the NutchDocument.
  2. digest: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure. It is calculated using MD5Signature or TextProfileSignature.
  3. batchId: The page belongs to a unique batchId, this is its identifier.
  4. boost: Boost is used to calculate document (field) score which can be used within queries submitted to the underlying indexing library to find the best results. It's part of the scoring algorithms. See scoring.link, scoring.opic, scoring.tld, etc.

Parameters:
key - The key of the page (reversed url).
page - The Webpage.
Returns:
The indexed document, or null if skipped by index filters.


Copyright © 2013 The Apache Software Foundation