|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
FetchSchedule
.String
can be decoded in reverse and the
first character is represented by a terminal node.
String
can be decoded and the last character is
represented by a terminal node.
ArchRecordReader
class provides a record reader which
reads records from arc files.CircularDependencyException
will be thrown if a circular
dependency is detected.MimeType
name by removing out the actual MimeType
,
from a string of the form:
Configuration
for Nutch.
Configuration
from supplied properties.
Text
object for the key.
RegexRule
.
BytesWritable
object for the key
DomainSuffix
objects
Note: this class is singletonExtension
is a kind of listener descriptor that will be
installed on a concrete ExtensionPoint
that acts as kind of
Publisher.ExtensionPoint
provide meta information of a extension
point.FetchSchedule
implementation.AnchorIndexingFilter
filter object which supports boolean
configuration settings for the deduplication of anchors.
BasicIndexingFilter
filter object which supports boolean
configurable value for length of characters permitted within the
title @see indexer.max.title.length
in nutch-default.xml
Indexer
for indexing within the Nutch
index.
RelTagIndexingFilter
filter object.
Outlink
's
MimeTypes.forName(String)
method.
WebPage.getScore()
.
Configurable
sDomainSuffix
object for the extension, if
extension is a top level domain returned object will be an
instance of TopLevelDomain
Configuration
object
Configuration
object
Configuration
object
Configuration
object
Configuration
object
DomainSuffix
corresponding to the
last public part of the hostname
DomainSuffix
corresponding to the
last public part of the hostname
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed.
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed.
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed.
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed.
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed.
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed.
robotsMeta
to appropriate
values, based on any META tags found under the given
node
.
robotsMeta
to appropriate
values, based on any META tags found under the given
node
.
MimeTypes.getMimeType(String)
method.
MimeTypes.getMimeType(File)
method.
node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
Outlink
from given plain text.
Outlink
from given plain text and adds anchor
to the extracted Outlink
s
node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
Configuration
object
Parser
instance with the specified
extId
, representing its extension ID.
Parser
s for a given content type.
Plugin
class.
null
.
Protocol
implementation for a url.
Content
for a fetchlist entry.
RecordReader
for reading the arc file.
url
with a configured HTTP client and gets the
response.
StringBuilder
and a DOM Node
,
and will append all the content text found beneath the DOM node to
the StringBuilder
.
getText(sb, node, false)
.
getText(sb, node, false)
.
StringBuffer
and a DOM Node
,
and will append the content text found beneath the first
title
node to the StringBuffer
.
StringBuffer
and a DOM Node
,
and will append the content text found beneath the first
title
node to the StringBuffer
.
IndexingFilter
implementing plugins.sizeLimit
bytes, if necessary.
false
if the robots.txt
file
prohibits us from accessing the given url
, or
true
otherwise.
false
if the robots.txt
file
prohibits us from accessing the given path
, or
true
otherwise.
false
if the robots.txt
file
prohibits us from accessing the given url
, or
true
otherwise.
IndexingFilter
that adds a
lang
(language) field to the document.s
padded with leading spaces so
that it's length is length
.
input that is matched,
or null if no match exists.
- longestMatch(String) -
Method in class org.apache.nutch.util.SuffixStringMatcher
- Returns the longest suffix of
input that is matched,
or null if no match exists.
- longestMatch(String) -
Method in class org.apache.nutch.util.TrieStringMatcher
- Returns the longest substring of
input that is
matched by a pattern in the trie, or null if no match
exists.
Parser
.
TrieStringMatcher.TrieNode
visited, given that you are at
node
, and the the next character in the input is
the idx
'th character of s
.
String
is matched by a
prefix in the trie
String
is matched by a
suffix in the trie
String
is matched by a
pattern in the trie
MissingDependencyException
will be thrown if a plugin
dependency cannot be found.Node
on the stack and pushes all of its
children onto the stack, allowing us to walk the node tree without the
use of recursion.
Node
tree from the root node.
Configuration
s that include Nutch-specific
resources.NutchDocument
is the unit of indexing.Job
for Nutch jobs.JobConf
for Nutch jobs.Plugin
System.http
,
httpclient
)Outlink
s
/ URLs from plain text using Regular Expressions.parse-plugins.xml
file and returns the
#ParsePluginList
defined by it.
Parser
s
until a successful parse is performed and a Parse
object is
returned.
ParseFilter
implementing plugins.$NUTCH_HOME/conf/parse-plugins.xml
file.Protocol
implementation.Parser
plugins.Parser
s to obtain
Parse
objects.PluginClassLoader
contains only classes of the runtime
libraries setuped in the plugin manifest file and exported libraries of
plugins that are required pluguin.PluginDescriptor
provide access to all meta information of
a nutch-plugin, as well to the internationalizable resources and the plugin
own classloader.PluginManifestParser
parser just parse the manifest file
in all plugin directories.PluginRuntimeException
will be thrown until a exception in the
plugin managemnt occurs.String
s against a set
of prefixes.PrefixStringMatcher
which will match
String
s with any prefix in the supplied array.
PrefixStringMatcher
which will match
String
s with any prefix in the supplied
Collection
.
ProtocolException
instead.Protocol
plugins.Java Regex implementation
.URL filter
based on
regular expressions.IndexingFilter
that adds tag
field(s) to the document.false
.
s
padded with trailing spaces so
that it's length is length
.
robots.txt
files.FetcherJob
when processing
redirect URLs.
GeneratorJob
.
InjectorJob
.
Outlink
instances.
URLPartitioner
.
ScoringFilter
implementing plugins.baseHref
.
Configuration
object
Configuration
object
Configuration
object used to configure this
IndexingFilter
.
Configuration
object
Configuration
object
Configuration
object for this Parser
.
Configuration
object
fetchInterval
and fetchTime
on a
successfully fetched page.
fetchInterval
and fetchTime
on a
successfully fetched page.
noCache
to true
.
noFollow
to true
.
noIndex
to true
.
refresh
to the supplied value.
refreshHref
.
refreshTime
.
input that is matched,
or null if no match exists.
- shortestMatch(String) -
Method in class org.apache.nutch.util.SuffixStringMatcher
- Returns the shortest suffix of
input that is matched,
or null if no match exists.
- shortestMatch(String) -
Method in class org.apache.nutch.util.TrieStringMatcher
- Returns the shortest substring of
input that is
matched by a pattern in the trie, or null if no match
exists.
- shouldFetch(String, WebPage, long) -
Method in class org.apache.nutch.crawl.AbstractFetchSchedule
- This method provides information whether the page is suitable for
selection in the current fetchlist.
- shouldFetch(String, WebPage, long) -
Method in interface org.apache.nutch.crawl.FetchSchedule
- This method provides information whether the page is suitable for
selection in the current fetchlist.
- shouldProcess(Utf8, Utf8) -
Static method in class org.apache.nutch.util.NutchJob
-
- shutDown() -
Method in class org.apache.nutch.plugin.Plugin
- Shutdown the plugin.
- Signature - Class in org.apache.nutch.crawl
-
- Signature() -
Constructor for class org.apache.nutch.crawl.Signature
-
- SIGNATURE_KEY -
Static variable in interface org.apache.nutch.metadata.Nutch
-
- SignatureComparator - Class in org.apache.nutch.crawl
-
- SignatureComparator() -
Constructor for class org.apache.nutch.crawl.SignatureComparator
-
- SignatureFactory - Class in org.apache.nutch.crawl
- Factory class, which instantiates a Signature implementation according to the
current Configuration configuration.
- size() -
Method in class org.apache.nutch.metadata.Metadata
- Returns the number of metadata names in this metadata.
- SIZEOF_BOOLEAN -
Static variable in class org.apache.nutch.util.Bytes
- Size of boolean in bytes
- SIZEOF_BYTE -
Static variable in class org.apache.nutch.util.Bytes
- Size of byte in bytes
- SIZEOF_CHAR -
Static variable in class org.apache.nutch.util.Bytes
- Size of char in bytes
- SIZEOF_DOUBLE -
Static variable in class org.apache.nutch.util.Bytes
- Size of double in bytes
- SIZEOF_FLOAT -
Static variable in class org.apache.nutch.util.Bytes
- Size of float in bytes
- SIZEOF_INT -
Static variable in class org.apache.nutch.util.Bytes
- Size of int in bytes
- SIZEOF_LONG -
Static variable in class org.apache.nutch.util.Bytes
- Size of long in bytes
- SIZEOF_SHORT -
Static variable in class org.apache.nutch.util.Bytes
- Size of short in bytes
- skip(DataInput) -
Static method in class org.apache.nutch.parse.Outlink
- Skips over one Outlink in the input.
- SKIP_TRUNCATED -
Static variable in class org.apache.nutch.parse.ParserJob
-
- skipChildren() -
Method in class org.apache.nutch.util.NodeWalker
- Skips over and removes from the node stack the children of the last
node.
- skippedEntity(String) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Receive notification of a skipped entity.
- SOLR_PREFIX -
Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- SolrConstants - Interface in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates - Class in org.apache.nutch.indexer.solr
- Utility class for deleting duplicate documents from a solr index.
- SolrDeleteDuplicates() -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- SolrDeleteDuplicates.SolrInputFormat - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrInputFormat() -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputFormat
-
- SolrDeleteDuplicates.SolrInputSplit - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrInputSplit() -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- SolrDeleteDuplicates.SolrInputSplit(int, int) -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- SolrDeleteDuplicates.SolrRecord - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrRecord() -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- SolrDeleteDuplicates.SolrRecord(String, float, long) -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- SolrDeleteDuplicates.SolrRecordReader - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrRecordReader(SolrDocumentList, int) -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecordReader
-
- SolrIndexerJob - Class in org.apache.nutch.indexer.solr
-
- SolrIndexerJob() -
Constructor for class org.apache.nutch.indexer.solr.SolrIndexerJob
-
- SolrMappingReader - Class in org.apache.nutch.indexer.solr
-
- SolrMappingReader(Configuration) -
Constructor for class org.apache.nutch.indexer.solr.SolrMappingReader
-
- SolrWriter - Class in org.apache.nutch.indexer.solr
-
- SolrWriter() -
Constructor for class org.apache.nutch.indexer.solr.SolrWriter
-
- sortByValue() -
Method in class org.apache.nutch.util.Histogram
-
- sortInverseByValue() -
Method in class org.apache.nutch.util.Histogram
-
- SOURCE -
Static variable in interface org.apache.nutch.metadata.DublinCore
- A reference to a resource from which the present resource is derived.
- SpellCheckedMetadata - Class in org.apache.nutch.metadata
- A decorator to Metadata that adds spellchecking capabilities to property
names.
- SpellCheckedMetadata() -
Constructor for class org.apache.nutch.metadata.SpellCheckedMetadata
-
- split(byte[], byte[], int) -
Static method in class org.apache.nutch.util.Bytes
- Split passed range.
- splitEnd -
Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- splitLen -
Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- splitStart -
Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- start() -
Method in class org.apache.nutch.api.NutchServer
-
- startCDATA() -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Report the start of a CDATA section.
- startDocument() -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Receive notification of the beginning of a document.
- startDTD(String, String, String) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Report the start of DTD declarations, if any.
- started -
Static variable in class org.apache.nutch.api.NutchApp
-
- startElement(String, String, String, Attributes) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Receive notification of the beginning of an element.
- startEntity(String) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Report the beginning of an entity.
- startPrefixMapping(String, String) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Begin the scope of a prefix-URI Namespace mapping.
- startsWith(byte[], byte[]) -
Static method in class org.apache.nutch.util.Bytes
- Return true if the byte array on the right is a prefix of the byte array
on the left.
- startUp() -
Method in class org.apache.nutch.plugin.Plugin
- Will be invoked until plugin start up.
- STAT_COUNTERS -
Static variable in interface org.apache.nutch.metadata.Nutch
- Counters.
- STAT_JOBS -
Static variable in interface org.apache.nutch.metadata.Nutch
- Jobs.
- STAT_MESSAGE -
Static variable in interface org.apache.nutch.metadata.Nutch
- Status / result message.
- STAT_PHASE -
Static variable in interface org.apache.nutch.metadata.Nutch
- Phase of processing.
- STAT_PROGRESS -
Static variable in interface org.apache.nutch.metadata.Nutch
- Progress (float).
- state -
Variable in class org.apache.nutch.api.JobStatus
-
- status -
Variable in class org.apache.nutch.util.NutchTool
-
- STATUS_BLOCKED -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_FAILED -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_FETCHED -
Static variable in class org.apache.nutch.crawl.CrawlStatus
- Page was successfully fetched.
- STATUS_GONE -
Static variable in class org.apache.nutch.crawl.CrawlStatus
- Page no longer exists.
- STATUS_GONE -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_MODIFIED -
Static variable in interface org.apache.nutch.crawl.FetchSchedule
- Page is known to have been modified since our last visit.
- STATUS_NOTFETCHING -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_NOTFOUND -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_NOTMODIFIED -
Static variable in class org.apache.nutch.crawl.CrawlStatus
- Fetching successful - page is not modified.
- STATUS_NOTMODIFIED -
Static variable in interface org.apache.nutch.crawl.FetchSchedule
- Page is known to remain unmodified since our last visit.
- STATUS_NOTMODIFIED -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_REDIR_EXCEEDED -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_REDIR_PERM -
Static variable in class org.apache.nutch.crawl.CrawlStatus
- Page permanently redirects to other page.
- STATUS_REDIR_TEMP -
Static variable in class org.apache.nutch.crawl.CrawlStatus
- Page temporarily redirects to other page.
- STATUS_RETRY -
Static variable in class org.apache.nutch.crawl.CrawlStatus
- Fetching unsuccessful, needs to be retried (transient errors).
- STATUS_RETRY -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_ROBOTS_DENIED -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_SUCCESS -
Static variable in class org.apache.nutch.parse.ParseStatusUtils
-
- STATUS_SUCCESS -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- STATUS_UNFETCHED -
Static variable in class org.apache.nutch.crawl.CrawlStatus
- Page was not fetched yet.
- STATUS_UNKNOWN -
Static variable in interface org.apache.nutch.crawl.FetchSchedule
- It is unknown whether page was changed since our last visit.
- STATUS_WOULDBLOCK -
Static variable in class org.apache.nutch.protocol.ProtocolStatusUtils
-
- stop(String, String) -
Method in class org.apache.nutch.api.impl.RAMJobManager
-
- stop(String, String) -
Method in interface org.apache.nutch.api.JobManager
-
- stop(boolean) -
Method in class org.apache.nutch.api.NutchServer
-
- stopJob() -
Method in class org.apache.nutch.crawl.Crawler
-
- stopJob() -
Method in class org.apache.nutch.util.NutchTool
- Stop the job with the possibility to resume.
- StorageUtils - Class in org.apache.nutch.storage
- Entry point to Gora store/mapreduce functionality.
- StorageUtils() -
Constructor for class org.apache.nutch.storage.StorageUtils
-
- store -
Variable in class org.apache.nutch.indexer.IndexerJob.IndexerMapper
-
- StringUtil - Class in org.apache.nutch.util
- A collection of String processing utility methods.
- StringUtil() -
Constructor for class org.apache.nutch.util.StringUtil
-
- stripNonCharCodepoints(String) -
Static method in class org.apache.nutch.indexer.solr.SolrWriter
-
- Subcollection - Class in org.apache.nutch.collection
- SubCollection represents a subset of index, you can define url patterns that
will indicate that particular page (url) is part of SubCollection.
- Subcollection(String, String, Configuration) -
Constructor for class org.apache.nutch.collection.Subcollection
- public Constructor
- Subcollection(Configuration) -
Constructor for class org.apache.nutch.collection.Subcollection
-
- SubcollectionIndexingFilter - Class in org.apache.nutch.indexer.subcollection
-
- SubcollectionIndexingFilter() -
Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- SubcollectionIndexingFilter(Configuration) -
Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- SUBJECT -
Static variable in interface org.apache.nutch.metadata.DublinCore
- The topic of the content of the resource.
- SUCCESS -
Static variable in interface org.apache.nutch.parse.ParseStatusCodes
- Parsing succeeded.
- SUCCESS -
Static variable in interface org.apache.nutch.protocol.ProtocolStatusCodes
- Content was retrieved without errors.
- SUCCESS_OK -
Static variable in interface org.apache.nutch.parse.ParseStatusCodes
-
- SUCCESS_REDIRECT -
Static variable in interface org.apache.nutch.parse.ParseStatusCodes
- Parsed content contains a directive to redirect to another URL.
- SuffixStringMatcher - Class in org.apache.nutch.util
- A class for efficiently matching
String
s against a set
of suffixes. - SuffixStringMatcher(String[]) -
Constructor for class org.apache.nutch.util.SuffixStringMatcher
- Creates a new
PrefixStringMatcher
which will match
String
s with any suffix in the supplied array.
- SuffixStringMatcher(Collection) -
Constructor for class org.apache.nutch.util.SuffixStringMatcher
- Creates a new
PrefixStringMatcher
which will match
String
s with any suffix in the supplied
Collection
- SuffixURLFilter - Class in org.apache.nutch.urlfilter.suffix
- Filters URLs based on a file of URL suffixes.
- SuffixURLFilter() -
Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- SuffixURLFilter(Reader) -
Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- SWFParser - Class in org.apache.nutch.parse.swf
- Parser for Flash SWF files.
- SWFParser() -
Constructor for class org.apache.nutch.parse.swf.SWFParser
-
Bytes.toBytes(boolean)
Bytes.SIZEOF_SHORT
bytes
long.
StringUtil.toHexString(byte[], String, int)
, where
sep = null; lineLen = Integer.MAX_VALUE
.
sizeLimit
bytes, if necessary.
URLFilter
implementing plugins.
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |