URLMetaScoringFilter (apache-nutch 1.8 API)

java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter

All Implemented Interfaces:

org.apache.hadoop.conf.Configurable, Pluggable, ScoringFilter
```
public class URLMetaScoringFilter
extends org.apache.hadoop.conf.Configured
implements ScoringFilter
```
For documentation:

See Also:
URLMetaIndexingFilter

Field Summary
- Fields inherited from interface org.apache.nutch.scoring.ScoringFilter
  X_POINT_ID

Constructor Summary

Constructors
Constructor and Description

URLMetaScoringFilter()

Constructors
Constructor and Description
`URLMetaScoringFilter()`

Method Summary

Methods
Modifier and Type	Method and Description
`CrawlDatum`	`distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl, ParseData parseData, Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)` This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object.
`float`	`generatorSortValue(org.apache.hadoop.io.Text url, CrawlDatum datum, float initSort)` Boilerplate
`org.apache.hadoop.conf.Configuration`	`getConf()` Boilerplate
`float`	`indexerScore(org.apache.hadoop.io.Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)` Boilerplate
`void`	`initialScore(org.apache.hadoop.io.Text url, CrawlDatum datum)` Boilerplate
`void`	`injectedScore(org.apache.hadoop.io.Text url, CrawlDatum datum)` Boilerplate
`void`	`passScoreAfterParsing(org.apache.hadoop.io.Text url, Content content, Parse parse)` Takes the metadata, which was lumped inside the content, and replicates it within your parse data.
`void`	`passScoreBeforeParsing(org.apache.hadoop.io.Text url, CrawlDatum datum, Content content)` Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content.
`void`	`setConf(org.apache.hadoop.conf.Configuration conf)` handles conf assignment and pulls the value assignment from the "urlmeta.tags" property
`void`	`updateDbScore(org.apache.hadoop.io.Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)` Boilerplate

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - URLMetaScoringFilter
```
public URLMetaScoringFilter()
```
- Method Detail
  - distributeScoreToOutlinks
```
public CrawlDatum distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
                                   ParseData parseData,
                                   Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
                                   CrawlDatum adjust,
                                   int allCount)
                                     throws ScoringFilterException
```
    This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object. If they exist, this will be propagated into your 'targets' Collection's ["outlinks"] attributes.
    
    Specified by:
    
    distributeScoreToOutlinks in interface ScoringFilter
    
    Parameters:
    fromUrl - url of the source page
    parseData - ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.
    targets - <url, CrawlDatum> pairs. NOTE: filters can modify this in-place, all changes will be persisted.
    adjust - a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status to CrawlDatum.STATUS_LINKED.
    allCount - number of all collected outlinks from the source page
    
    Returns:
    if needed, implementations may return an instance of CrawlDatum, with status CrawlDatum.STATUS_LINKED, which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed.
    
    Throws:
    
    ScoringFilterException
    See Also:
    ScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text, org.apache.nutch.parse.ParseData, java.util.Collection<java.util.Map.Entry<org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum>>, org.apache.nutch.crawl.CrawlDatum, int)
  - passScoreBeforeParsing
```
public void passScoreBeforeParsing(org.apache.hadoop.io.Text url,
                          CrawlDatum datum,
                          Content content)
```
    Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content. This is transfered to the parseData object.
    
    Specified by:
    
    passScoreBeforeParsing in interface ScoringFilter
    
    Parameters:
    url - url of the page
    datum - source datum. NOTE: modifications to this value are not persisted.
    content - instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.
    See Also:
    ScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.protocol.Content), passScoreAfterParsing(org.apache.hadoop.io.Text, org.apache.nutch.protocol.Content, org.apache.nutch.parse.Parse)
  - passScoreAfterParsing
```
public void passScoreAfterParsing(org.apache.hadoop.io.Text url,
                         Content content,
                         Parse parse)
```
    Takes the metadata, which was lumped inside the content, and replicates it within your parse data.
    
    Specified by:
    
    passScoreAfterParsing in interface ScoringFilter
    
    Parameters:
    url - page url
    content - original content. NOTE: modifications to this value are not persisted.
    parse - target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.
    See Also:
    passScoreBeforeParsing(org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.protocol.Content), ScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text, org.apache.nutch.protocol.Content, org.apache.nutch.parse.Parse)
  - generatorSortValue
```
public float generatorSortValue(org.apache.hadoop.io.Text url,
                       CrawlDatum datum,
                       float initSort)
                         throws ScoringFilterException
```
    Boilerplate
    
    Specified by:
    
    generatorSortValue in interface ScoringFilter
    
    Parameters:
    url - url of the page
    datum - page's datum, should not be modified
    initSort - initial sort value, or a value from previous filters in chain
    
    Throws:
    
    ScoringFilterException
  - indexerScore
```
public float indexerScore(org.apache.hadoop.io.Text url,
                 NutchDocument doc,
                 CrawlDatum dbDatum,
                 CrawlDatum fetchDatum,
                 Parse parse,
                 Inlinks inlinks,
                 float initScore)
                   throws ScoringFilterException
```
    Boilerplate
    
    Specified by:
    
    indexerScore in interface ScoringFilter
    
    Parameters:
    url - url of the page
    doc - Lucene document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.
    dbDatum - current page from CrawlDb. NOTE: changes made to this instance are not persisted.
    fetchDatum - datum from FetcherOutput (containing among others the fetching status)
    parse - parsing result. NOTE: changes made to this instance are not persisted.
    inlinks - current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.
    initScore - initial boost value for the Lucene document.
    
    Returns:
    boost value for the Lucene document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying Lucene document directly.
    
    Throws:
    
    ScoringFilterException
  - initialScore
```
public void initialScore(org.apache.hadoop.io.Text url,
                CrawlDatum datum)
                  throws ScoringFilterException
```
    Boilerplate
    
    Specified by:
    
    initialScore in interface ScoringFilter
    
    Parameters:
    url - url of the page
    datum - new datum. Filters will modify it in-place.
    
    Throws:
    
    ScoringFilterException
  - injectedScore
```
public void injectedScore(org.apache.hadoop.io.Text url,
                 CrawlDatum datum)
                   throws ScoringFilterException
```
    Boilerplate
    
    Specified by:
    
    injectedScore in interface ScoringFilter
    
    Parameters:
    url - url of the page
    datum - new datum. Filters will modify it in-place.
    
    Throws:
    
    ScoringFilterException
  - updateDbScore
```
public void updateDbScore(org.apache.hadoop.io.Text url,
                 CrawlDatum old,
                 CrawlDatum datum,
                 List<CrawlDatum> inlinked)
                   throws ScoringFilterException
```
    Boilerplate
    
    Specified by:
    
    updateDbScore in interface ScoringFilter
    
    Parameters:
    url - url of the page
    old - original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - the datum parameter may contain values that are no longer valid, if other updates occured between generation and this update.
    datum - the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.
    inlinked - (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch.
    
    Throws:
    
    ScoringFilterException
  - setConf
```
public void setConf(org.apache.hadoop.conf.Configuration conf)
```
    handles conf assignment and pulls the value assignment from the "urlmeta.tags" property
    
    Specified by:
    
    setConf in interface org.apache.hadoop.conf.Configurable
    
    Overrides:
    
    setConf in class org.apache.hadoop.conf.Configured
  - getConf
```
public org.apache.hadoop.conf.Configuration getConf()
```
    Boilerplate
    
    Specified by:
    
    getConf in interface org.apache.hadoop.conf.Configurable
    
    Overrides:
    
    getConf in class org.apache.hadoop.conf.Configured

Class URLMetaScoringFilter

Field Summary

Fields inherited from interface org.apache.nutch.scoring.ScoringFilter

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

URLMetaScoringFilter

Method Detail

distributeScoreToOutlinks

passScoreBeforeParsing

passScoreAfterParsing

generatorSortValue

indexerScore

initialScore

injectedScore

updateDbScore

setConf

getConf