IndexUtil (apache-nutch 2.2.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.indexer
Class IndexUtil

java.lang.Object
  org.apache.nutch.indexer.IndexUtil

Utility to create an indexed document from a webpage.

Constructor Summary
`IndexUtil(org.apache.hadoop.conf.Configuration conf)`

Method Summary
`NutchDocument`	`index(String key, WebPage page)` Index a `Webpage`, here we add the following fields: `id`: default uniqueKey for the `NutchDocument`. `digest`: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

public IndexUtil(org.apache.hadoop.conf.Configuration conf)

Method Detail

public NutchDocument index(String key,
                           WebPage page)

Index a Webpage, here we add the following fields:

id: default uniqueKey for the NutchDocument.
digest: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure. It is calculated using MD5Signature or TextProfileSignature.
batchId: The page belongs to a unique batchId, this is its identifier.
boost: Boost is used to calculate document (field) score which can be used within queries submitted to the underlying indexing library to find the best results. It's part of the scoring algorithms. See scoring.link, scoring.opic, scoring.tld, etc.

Parameters:: key - The key of the page (reversed url).; page - The Webpage.
Returns:: The indexed document, or null if skipped by index filters.