public class RegexURLNormalizer extends Configured implements URLNormalizer
This class uses the urlnormalizer.regex.file property. It should be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.
This class also supports different rules depending on the scope. Please see
the javadoc in URLNormalizers
for more details.
X_POINT_ID
Constructor and Description |
---|
RegexURLNormalizer()
The default constructor which is called from UrlNormalizerFactory
(normalizerClass.newInstance()) in method: getNormalizer()*
|
RegexURLNormalizer(Configuration conf) |
RegexURLNormalizer(Configuration conf,
String filename)
Constructor which can be passed the file name, so it doesn't look in the
configuration files for it.
|
Modifier and Type | Method and Description |
---|---|
HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> |
getScopedRules() |
static void |
main(String[] args)
Spits out patterns and substitutions that are in the configuration file.
|
String |
normalize(String urlString,
String scope) |
String |
regexNormalize(String urlString,
String scope)
This function does the replacements by iterating through all the regex
patterns.
|
void |
setConf(Configuration conf) |
getConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf
public RegexURLNormalizer()
public RegexURLNormalizer(Configuration conf)
public RegexURLNormalizer(Configuration conf, String filename) throws IOException, PatternSyntaxException
IOException
PatternSyntaxException
public HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> getScopedRules()
public void setConf(Configuration conf)
setConf
in interface Configurable
setConf
in class Configured
public String regexNormalize(String urlString, String scope)
public String normalize(String urlString, String scope) throws MalformedURLException
normalize
in interface URLNormalizer
MalformedURLException
public static void main(String[] args) throws PatternSyntaxException, IOException
PatternSyntaxException
IOException
Copyright © 2015 The Apache Software Foundation