org.apache.nutch.parse
Class OutlinkExtractor
java.lang.Object
org.apache.nutch.parse.OutlinkExtractor
public class OutlinkExtractor
- extends Object
Extractor to extract Outlink
s
/ URLs from plain text using Regular Expressions.
- Since:
- 0.7
- Version:
- 1.0
- Author:
- Stephan Strittmatter - http://www.sybit.de
- See Also:
- Comparison
of different regexp-Implementations ,
Overview about Java Regexp APIs
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
OutlinkExtractor
public OutlinkExtractor()
getOutlinks
public static Outlink[] getOutlinks(String plainText,
Configuration conf)
- Extracts
Outlink
from given plain text.
Applying this method to non-plain-text can result in extremely lengthy
runtimes for parasitic cases (postscript is a known example).
- Parameters:
plainText
- the plain text from wich URLs should be extracted.
- Returns:
- Array of
Outlink
s within found in plainText
getOutlinks
public static Outlink[] getOutlinks(String plainText,
String anchor,
Configuration conf)
- Extracts
Outlink
from given plain text and adds anchor
to the extracted Outlink
s
- Parameters:
plainText
- the plain text from wich URLs should be extracted.anchor
- the anchor of the url
- Returns:
- Array of
Outlink
s within found in plainText
Copyright © 2011 The Apache Software Foundation