org.apache.nutch.collection
Class Subcollection

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.collection.Subcollection
All Implemented Interfaces:
Configurable, URLFilter, Pluggable

public class Subcollection
extends Configured
implements URLFilter

SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.


Field Summary
static String TAG_BLACKLIST
           
static String TAG_COLLECTION
           
static String TAG_COLLECTIONS
           
static String TAG_ID
           
static String TAG_KEY
           
static String TAG_NAME
           
static String TAG_WHITELIST
           
 
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
 
Constructor Summary
Subcollection(Configuration conf)
           
Subcollection(String id, String name, Configuration conf)
          public Constructor
Subcollection(String id, String name, String key, Configuration conf)
          public Constructor
 
Method Summary
 String filter(String urlString)
          Simple "indexOf" currentFilter for matching patterns.
 String getBlackListString()
          Returns blacklist String
 String getId()
           
 String getKey()
           
 String getName()
           
 ArrayList getWhiteList()
          Returns whitelist
 String getWhiteListString()
          Returns whitelist String
 void initialize(Element collection)
          Initialize Subcollection from dom element
protected  void parseList(ArrayList list, String text)
          Create a list of patterns from chunk of text, patterns are separated with newline
 void setBlackList(String list)
          Set contents of blacklist from String
 void setWhiteList(ArrayList whiteList)
           
 void setWhiteList(String list)
          Set contents of whitelist from String
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

TAG_COLLECTIONS

public static final String TAG_COLLECTIONS
See Also:
Constant Field Values

TAG_COLLECTION

public static final String TAG_COLLECTION
See Also:
Constant Field Values

TAG_WHITELIST

public static final String TAG_WHITELIST
See Also:
Constant Field Values

TAG_BLACKLIST

public static final String TAG_BLACKLIST
See Also:
Constant Field Values

TAG_NAME

public static final String TAG_NAME
See Also:
Constant Field Values

TAG_KEY

public static final String TAG_KEY
See Also:
Constant Field Values

TAG_ID

public static final String TAG_ID
See Also:
Constant Field Values
Constructor Detail

Subcollection

public Subcollection(String id,
                     String name,
                     Configuration conf)
public Constructor

Parameters:
id - id of SubCollection
name - name of SubCollection

Subcollection

public Subcollection(String id,
                     String name,
                     String key,
                     Configuration conf)
public Constructor

Parameters:
id - id of SubCollection
name - name of SubCollection

Subcollection

public Subcollection(Configuration conf)
Method Detail

getName

public String getName()
Returns:
Returns the name

getKey

public String getKey()
Returns:
Returns the key

getId

public String getId()
Returns:
Returns the id

getWhiteList

public ArrayList getWhiteList()
Returns whitelist

Returns:
Whitelist entries

getWhiteListString

public String getWhiteListString()
Returns whitelist String

Returns:
Whitelist String

getBlackListString

public String getBlackListString()
Returns blacklist String

Returns:
Blacklist String

setWhiteList

public void setWhiteList(ArrayList whiteList)
Parameters:
whiteList - The whiteList to set.

filter

public String filter(String urlString)
Simple "indexOf" currentFilter for matching patterns.
  rules for evaluation are as follows:
  1. if pattern matches in blacklist then url is rejected
  2. if pattern matches in whitelist then url is allowed
  3. url is rejected
 

Specified by:
filter in interface URLFilter
See Also:
URLFilter.filter(java.lang.String)

initialize

public void initialize(Element collection)
Initialize Subcollection from dom element

Parameters:
collection -

parseList

protected void parseList(ArrayList list,
                         String text)
Create a list of patterns from chunk of text, patterns are separated with newline

Parameters:
list -
text -

setBlackList

public void setBlackList(String list)
Set contents of blacklist from String

Parameters:
list - the blacklist contents

setWhiteList

public void setWhiteList(String list)
Set contents of whitelist from String

Parameters:
list - the whitelist contents


Copyright © 2012 The Apache Software Foundation