- CACHE - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
- CACHING_FORBIDDEN_ALL - Static variable in interface org.apache.nutch.metadata.Nutch
-
Don't show either original forbidden content or summaries.
- CACHING_FORBIDDEN_CONTENT - Static variable in interface org.apache.nutch.metadata.Nutch
-
Don't show original forbidden content, but show summaries.
- CACHING_FORBIDDEN_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
Sites may request that search engines don't provide access to cached documents.
- CACHING_FORBIDDEN_NONE - Static variable in interface org.apache.nutch.metadata.Nutch
-
Show both original forbidden content and summaries (default).
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.MD5Signature
-
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.Signature
-
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.TextProfileSignature
-
- calculateLastFetchTime(CrawlDatum) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method return the last fetch time of the CrawlDatum
- calculateLastFetchTime(CrawlDatum) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Calculates last fetch time of the given CrawlDatum.
- CCIndexingFilter - Class in org.creativecommons.nutch
-
Adds basic searchable fields to a document.
- CCIndexingFilter() - Constructor for class org.creativecommons.nutch.CCIndexingFilter
-
- CCParseFilter - Class in org.creativecommons.nutch
-
Adds metadata identifying the Creative Commons license used, if any.
- CCParseFilter() - Constructor for class org.creativecommons.nutch.CCParseFilter
-
- CCParseFilter.Walker - Class in org.creativecommons.nutch
-
Walks DOM tree, looking for RDF in comments and licenses in anchors.
- cdata(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of cdata.
- CHAR_ENCODING_FOR_CONVERSION - Static variable in interface org.apache.nutch.metadata.Nutch
-
- characters(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of character data.
- charactersRaw(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
If available, when the disable-output-escaping attribute is used,
output raw text without escaping.
- CHARSET_UTF8 - Static variable in class org.apache.nutch.parse.feed.FeedParser
-
- CHECK_BLOCKING - Static variable in interface org.apache.nutch.protocol.Protocol
-
Property name.
- CHECK_ROBOTS - Static variable in interface org.apache.nutch.protocol.Protocol
-
Property name.
- checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- checkOutputSpecs(FileSystem, JobConf) - Method in class org.apache.nutch.fetcher.FetcherOutputFormat
-
- checkOutputSpecs(FileSystem, JobConf) - Method in class org.apache.nutch.parse.ParseOutputFormat
-
- checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- childLen - Variable in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
-
- children - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
-
- childrenList - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
-
- chooseRepr(String, String, boolean) - Static method in class org.apache.nutch.util.URLUtil
-
Given two urls, a src and a destination of a redirect, it returns the
representative url.
- CircularDependencyException - Exception in org.apache.nutch.plugin
-
CircularDependencyException
will be thrown if a circular
dependency is detected.
- CircularDependencyException(Throwable) - Constructor for exception org.apache.nutch.plugin.CircularDependencyException
-
- CircularDependencyException(String) - Constructor for exception org.apache.nutch.plugin.CircularDependencyException
-
- cleanField(String) - Static method in class org.apache.nutch.util.StringUtil
-
Simple character substitution which cleans all � chars from a given String.
- CleaningJob - Class in org.apache.nutch.indexer
-
The class scans CrawlDB looking for entries with status DB_GONE (404) or
DB_DUPLICATE and
sends delete requests to indexers for those documents.
- CleaningJob() - Constructor for class org.apache.nutch.indexer.CleaningJob
-
- CleaningJob.DBFilter - Class in org.apache.nutch.indexer
-
- CleaningJob.DBFilter() - Constructor for class org.apache.nutch.indexer.CleaningJob.DBFilter
-
- CleaningJob.DeleterReducer - Class in org.apache.nutch.indexer
-
- CleaningJob.DeleterReducer() - Constructor for class org.apache.nutch.indexer.CleaningJob.DeleterReducer
-
- cleanMimeType(String) - Static method in class org.apache.nutch.util.MimeUtil
-
Cleans a MimeType
name by removing out the actual MimeType
,
from a string of the form:
- clear() - Method in class org.apache.nutch.crawl.Inlinks
-
- clear() - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- clear() - Method in class org.apache.nutch.metadata.Metadata
-
Remove all mappings from metadata.
- clearClues() - Method in class org.apache.nutch.util.EncodingDetector
-
Clears all clues.
- Client - Class in org.apache.nutch.protocol.ftp
-
Client.java encapsulates functionalities necessary for nutch to
get dir list and retrieve file from an FTP server.
- Client() - Constructor for class org.apache.nutch.protocol.ftp.Client
-
Public default constructor
- clone() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- clone() - Method in class org.apache.nutch.indexer.NutchField
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbFilter
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- close(Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatCombiner
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
-
- close() - Method in class org.apache.nutch.crawl.CrawlDbReducer
-
- close() - Method in class org.apache.nutch.crawl.DeduplicationJob.DBFilter
-
- close() - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
-
- close() - Method in class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
-
- close() - Method in class org.apache.nutch.crawl.Generator.Selector
-
- close() - Method in class org.apache.nutch.crawl.Injector.InjectMapper
-
- close() - Method in class org.apache.nutch.crawl.Injector.InjectReducer
-
- close() - Method in class org.apache.nutch.crawl.LinkDb
-
- close() - Method in class org.apache.nutch.crawl.LinkDbFilter
-
- close() - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- close() - Method in class org.apache.nutch.crawl.LinkDbReader
-
- close() - Method in class org.apache.nutch.crawl.URLPartitioner
-
- close() - Method in class org.apache.nutch.fetcher.Fetcher
-
- close() - Method in class org.apache.nutch.fetcher.OldFetcher
-
- close() - Method in class org.apache.nutch.indexer.CleaningJob.DBFilter
-
- close() - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
-
- close() - Method in class org.apache.nutch.indexer.IndexerMapReduce
-
- close() - Method in interface org.apache.nutch.indexer.IndexWriter
-
- close() - Method in class org.apache.nutch.indexer.IndexWriters
-
- close() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- close() - Method in class org.apache.nutch.parse.ParseSegment
-
- close() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
- close() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
- close() - Method in class org.apache.nutch.scoring.webgraph.LinkRank
-
- close() - Method in class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
- close() - Method in class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
- close() - Method in class org.apache.nutch.scoring.webgraph.Loops.Looper
-
- close() - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
-
- close() - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
-
- close() - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- close() - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
- close() - Method in class org.apache.nutch.segment.SegmentMerger
-
- close() - Method in class org.apache.nutch.segment.SegmentReader
-
- close() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Closes the record reader resources.
- close() - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
- closeReaders(SequenceFile.Reader[]) - Static method in class org.apache.nutch.util.FSUtils
-
Closes a group of SequenceFile readers.
- closeReaders(MapFile.Reader[]) - Static method in class org.apache.nutch.util.FSUtils
-
Closes a group of MapFile readers.
- CollectionManager - Class in org.apache.nutch.collection
-
- CollectionManager(Configuration) - Constructor for class org.apache.nutch.collection.CollectionManager
-
- CollectionManager() - Constructor for class org.apache.nutch.collection.CollectionManager
-
Used for testing
- CommandRunner - Class in org.apache.nutch.util
-
- CommandRunner() - Constructor for class org.apache.nutch.util.CommandRunner
-
- comment(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report an XML comment anywhere in the document.
- commit() - Method in interface org.apache.nutch.indexer.IndexWriter
-
- commit() - Method in class org.apache.nutch.indexer.IndexWriters
-
- commit() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- COMMIT_INDEX - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
Deprecated.
- COMMIT_SIZE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.CrawlDatum.Comparator
-
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.Generator.DecreasingFloatComparator
-
Compares two FloatWritables decreasing.
- compare(WritableComparable, WritableComparable) - Method in class org.apache.nutch.crawl.Generator.HashComparator
-
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.Generator.HashComparator
-
- compare(Object, Object) - Method in class org.apache.nutch.crawl.SignatureComparator
-
- compareTo(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Sort by decreasing score.
- compareTo(TrieStringMatcher.TrieNode) - Method in class org.apache.nutch.util.TrieStringMatcher.TrieNode
-
- conf - Variable in class org.apache.nutch.crawl.Signature
-
- conf - Variable in class org.apache.nutch.plugin.Plugin
-
- conf - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbFilter
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatCombiner
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.CrawlDbReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.DeduplicationJob.DBFilter
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.Generator.Selector
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.Injector.InjectMapper
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.Injector.InjectReducer
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.LinkDb
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.LinkDbFilter
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- configure(JobConf) - Method in class org.apache.nutch.crawl.URLPartitioner
-
- configure(JobConf) - Method in class org.apache.nutch.fetcher.Fetcher
-
- configure(JobConf) - Method in class org.apache.nutch.fetcher.OldFetcher
-
- configure(JobConf) - Method in class org.apache.nutch.indexer.CleaningJob.DBFilter
-
- configure(JobConf) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
-
- configure(JobConf) - Method in class org.apache.nutch.indexer.IndexerMapReduce
-
- configure(JobConf) - Method in class org.apache.nutch.parse.ParseSegment
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Configures the job.
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Configure the job.
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Configure the job.
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
-
Configures the job, sets the flag for type of content and the topN number
if any.
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- configure(JobConf) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
Configures the OutlinkDb job.
- configure(JobConf) - Method in class org.apache.nutch.segment.SegmentMerger
-
- configure(JobConf) - Method in class org.apache.nutch.segment.SegmentReader
-
- configure(JobConf) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Configures the job.
- configure(JobConf) - Method in class org.apache.nutch.tools.FreeGenerator.FG
-
- containsKey(Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- containsValue(Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- Content - Class in org.apache.nutch.protocol
-
- Content() - Constructor for class org.apache.nutch.protocol.Content
-
- Content(String, String, byte[], String, Metadata, Configuration) - Constructor for class org.apache.nutch.protocol.Content
-
- CONTENT_DISPOSITION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_ENCODING - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_LANGUAGE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_LENGTH - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_LOCATION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_MD5 - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- CONTENT_REDIR - Static variable in class org.apache.nutch.fetcher.Fetcher
-
- CONTENT_REDIR - Static variable in class org.apache.nutch.fetcher.OldFetcher
-
- CONTENT_TYPE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- ContentAsTextInputFormat - Class in org.apache.nutch.segment
-
An input format that takes Nutch Content objects and converts them to text
while converting newline endings to spaces.
- ContentAsTextInputFormat() - Constructor for class org.apache.nutch.segment.ContentAsTextInputFormat
-
- CONTRIBUTOR - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity responsible for making contributions to the content of the
resource.
- COVERAGE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
The extent or scope of the content of the resource.
- CrawlDatum - Class in org.apache.nutch.crawl
-
- CrawlDatum() - Constructor for class org.apache.nutch.crawl.CrawlDatum
-
- CrawlDatum(int, int) - Constructor for class org.apache.nutch.crawl.CrawlDatum
-
- CrawlDatum(int, int, float) - Constructor for class org.apache.nutch.crawl.CrawlDatum
-
- CrawlDatum.Comparator - Class in org.apache.nutch.crawl
-
A Comparator optimized for CrawlDatum.
- CrawlDatum.Comparator() - Constructor for class org.apache.nutch.crawl.CrawlDatum.Comparator
-
- CrawlDb - Class in org.apache.nutch.crawl
-
This class takes the output of the fetcher and updates the
crawldb accordingly.
- CrawlDb() - Constructor for class org.apache.nutch.crawl.CrawlDb
-
- CrawlDb(Configuration) - Constructor for class org.apache.nutch.crawl.CrawlDb
-
- CRAWLDB_ADDITIONS_ALLOWED - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- CRAWLDB_PURGE_404 - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- CrawlDbFilter - Class in org.apache.nutch.crawl
-
This class provides a way to separate the URL normalization
and filtering steps from the rest of CrawlDb manipulation code.
- CrawlDbFilter() - Constructor for class org.apache.nutch.crawl.CrawlDbFilter
-
- CrawlDbMerger - Class in org.apache.nutch.crawl
-
This tool merges several CrawlDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited
pages.
- CrawlDbMerger() - Constructor for class org.apache.nutch.crawl.CrawlDbMerger
-
- CrawlDbMerger(Configuration) - Constructor for class org.apache.nutch.crawl.CrawlDbMerger
-
- CrawlDbMerger.Merger - Class in org.apache.nutch.crawl
-
- CrawlDbMerger.Merger() - Constructor for class org.apache.nutch.crawl.CrawlDbMerger.Merger
-
- CrawlDbReader - Class in org.apache.nutch.crawl
-
Read utility for the CrawlDB.
- CrawlDbReader() - Constructor for class org.apache.nutch.crawl.CrawlDbReader
-
- CrawlDbReader.CrawlDatumCsvOutputFormat - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDatumCsvOutputFormat() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat
-
- CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter(DataOutputStream) - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
-
- CrawlDbReader.CrawlDbDumpMapper - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbDumpMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
-
- CrawlDbReader.CrawlDbStatCombiner - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbStatCombiner() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatCombiner
-
- CrawlDbReader.CrawlDbStatMapper - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbStatMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
-
- CrawlDbReader.CrawlDbStatReducer - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbStatReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
-
- CrawlDbReader.CrawlDbTopNMapper - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbTopNMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
-
- CrawlDbReader.CrawlDbTopNReducer - Class in org.apache.nutch.crawl
-
- CrawlDbReader.CrawlDbTopNReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
-
- CrawlDbReducer - Class in org.apache.nutch.crawl
-
Merge new page entries with existing entries.
- CrawlDbReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReducer
-
- create() - Static method in class org.apache.nutch.util.NutchConfiguration
-
Create a Configuration
for Nutch.
- create(boolean, Properties) - Static method in class org.apache.nutch.util.NutchConfiguration
-
Create a Configuration
from supplied properties.
- createJob(Configuration, Path) - Static method in class org.apache.nutch.crawl.CrawlDb
-
- createKey() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Creates a new instance of the Text
object for the key.
- createLockFile(FileSystem, Path, boolean) - Static method in class org.apache.nutch.util.LockUtil
-
Create a lock file.
- createMergeJob(Configuration, Path, boolean, boolean) - Static method in class org.apache.nutch.crawl.CrawlDbMerger
-
- createMergeJob(Configuration, Path, boolean, boolean) - Static method in class org.apache.nutch.crawl.LinkDbMerger
-
- createParseResult(String, Parse) - Static method in class org.apache.nutch.parse.ParseResult
-
Convenience method for obtaining
ParseResult
from a single
Parse
output.
- createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
-
- createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- createSegments(Path, Path) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Creates the arc files to segments job.
- createSocket(String, int, InetAddress, int) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
- createSocket(String, int, InetAddress, int, HttpConnectionParams) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
Attempts to get a new socket connection to the given host within the given
time limit.
- createSocket(String, int) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
- createSocket(Socket, String, int, boolean) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
- createSubCollection(String, String) - Method in class org.apache.nutch.collection.CollectionManager
-
Create a new subcollection.
- createValue() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Creates a new instance of the BytesWritable
object for the key
- createWebGraph(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.scoring.webgraph.WebGraph
-
Creates the three different WebGraph databases, Outlinks, Inlinks, and
Node.
- CreativeCommons - Interface in org.apache.nutch.metadata
-
A collection of Creative Commons properties names.
- CREATOR - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity primarily responsible for making the content of the resource.
- CURRENT_NAME - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- CURRENT_NAME - Static variable in class org.apache.nutch.crawl.LinkDb
-
- DATE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A date associated with an event in the life cycle of the resource.
- dateFormatStr - Static variable in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
- datum - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
-
- debug - Variable in class org.apache.nutch.tools.proxy.AbstractTestbedHandler
-
- DEC_RATE - Variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- DeduplicationJob - Class in org.apache.nutch.crawl
-
Generic deduplicator which groups fetched URLs with the same digest and marks
all of them as duplicate except the one with the highest score (based on the
score in the crawldb, which is not necessarily the same as the score
indexed).
- DeduplicationJob() - Constructor for class org.apache.nutch.crawl.DeduplicationJob
-
- DeduplicationJob.DBFilter - Class in org.apache.nutch.crawl
-
- DeduplicationJob.DBFilter() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.DBFilter
-
- DeduplicationJob.DedupReducer - Class in org.apache.nutch.crawl
-
- DeduplicationJob.DedupReducer() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
-
- DeduplicationJob.StatusUpdateReducer - Class in org.apache.nutch.crawl
-
Combine multiple new entries for a url.
- DeduplicationJob.StatusUpdateReducer() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
-
- DEFAULT_BOOST - Static variable in class org.apache.nutch.util.domain.DomainSuffix
-
- DEFAULT_DELAY - Static variable in class org.apache.nutch.tools.proxy.DelayHandler
-
- DEFAULT_FILE_NAME - Static variable in class org.apache.nutch.collection.CollectionManager
-
- DEFAULT_PLUGIN - Static variable in class org.apache.nutch.parse.ParserFactory
-
Wildcard for default plugins.
- DEFAULT_STATUS - Static variable in class org.apache.nutch.util.domain.DomainSuffix
-
- DefaultFetchSchedule - Class in org.apache.nutch.crawl
-
This class implements the default re-fetch schedule.
- DefaultFetchSchedule() - Constructor for class org.apache.nutch.crawl.DefaultFetchSchedule
-
- defaultInterval - Variable in class org.apache.nutch.crawl.AbstractFetchSchedule
-
- deflate(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns a deflated copy of the input array.
- DeflateUtils - Class in org.apache.nutch.util
-
A collection of utility methods for working on deflated data.
- DeflateUtils() - Constructor for class org.apache.nutch.util.DeflateUtils
-
- DelayHandler - Class in org.apache.nutch.tools.proxy
-
- DelayHandler(int) - Constructor for class org.apache.nutch.tools.proxy.DelayHandler
-
- delete(String, boolean) - Method in class org.apache.nutch.indexer.CleaningJob
-
- delete(String) - Method in interface org.apache.nutch.indexer.IndexWriter
-
- delete(String) - Method in class org.apache.nutch.indexer.IndexWriters
-
- DELETE - Static variable in class org.apache.nutch.indexer.NutchIndexAction
-
- delete(String) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- deleteSubCollection(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Delete named subcollection
- describe() - Method in interface org.apache.nutch.indexer.IndexWriter
-
Returns a String describing the IndexWriter instance and the specific parameters it can take
- describe() - Method in class org.apache.nutch.indexer.IndexWriters
-
- describe() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- DESCRIPTION - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An account of the content of the resource.
- DIR_NAME - Static variable in class org.apache.nutch.parse.ParseData
-
- DIR_NAME - Static variable in class org.apache.nutch.parse.ParseText
-
- DIR_NAME - Static variable in class org.apache.nutch.protocol.Content
-
- disconnect() - Method in class org.apache.nutch.protocol.ftp.Client
-
Closes the connection to the FTP server and restores
connection parameters to the default values.
- distributeScoreToOutlink(Text, Text, ParseData, CrawlDatum, CrawlDatum, int, int) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Distribute score value from the current page to all its outlinked pages.
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.ScoringFilters
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the parseData object.
- DmozParser - Class in org.apache.nutch.tools
-
Utility that converts DMOZ RDF into a flat file of URLs to be injected.
- DmozParser() - Constructor for class org.apache.nutch.tools.DmozParser
-
- doc - Variable in class org.apache.nutch.indexer.NutchIndexAction
-
- doFilter(ServletRequest, ServletResponse, FilterChain) - Method in class org.apache.nutch.tools.proxy.LogDebugHandler
-
- DomainBlacklistURLFilter - Class in org.apache.nutch.urlfilter.domainblacklist
-
Filters URLs based on a file containing domain suffixes, domain names, and
hostnames.
- DomainBlacklistURLFilter() - Constructor for class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
Default constructor.
- DomainBlacklistURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
Constructor that specifies the domain file to use.
- DomainStatistics - Class in org.apache.nutch.util.domain
-
Extracts some very basic statistics about domains from the crawldb
- DomainStatistics() - Constructor for class org.apache.nutch.util.domain.DomainStatistics
-
- DomainStatistics.DomainStatisticsCombiner - Class in org.apache.nutch.util.domain
-
- DomainStatistics.DomainStatisticsCombiner() - Constructor for class org.apache.nutch.util.domain.DomainStatistics.DomainStatisticsCombiner
-
- DomainStatistics.MyCounter - Enum in org.apache.nutch.util.domain
-
- DomainSuffix - Class in org.apache.nutch.util.domain
-
This class represents the last part of the host name,
which is operated by authoritives, not individuals.
- DomainSuffix(String, DomainSuffix.Status, float) - Constructor for class org.apache.nutch.util.domain.DomainSuffix
-
- DomainSuffix(String) - Constructor for class org.apache.nutch.util.domain.DomainSuffix
-
- DomainSuffix.Status - Enum in org.apache.nutch.util.domain
-
Enumeration of the status of the tld.
- DomainSuffixes - Class in org.apache.nutch.util.domain
-
Storage class for DomainSuffix
objects
Note: this class is singleton
- DomainURLFilter - Class in org.apache.nutch.urlfilter.domain
-
Filters URLs based on a file containing domain suffixes, domain names, and
hostnames.
- DomainURLFilter() - Constructor for class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
Default constructor.
- DomainURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
Constructor that specifies the domain file to use.
- DOMBuilder - Class in org.apache.nutch.parse.html
-
This class takes SAX events (in addition to some extra events
that SAX doesn't handle yet) and adds the result to a document
or document fragment.
- DOMBuilder(Document, Node) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMBuilder(Document, DocumentFragment) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMBuilder(Document) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMContentUtils - Class in org.apache.nutch.parse.html
-
A collection of methods for extracting content from DOM trees.
- DOMContentUtils(Configuration) - Constructor for class org.apache.nutch.parse.html.DOMContentUtils
-
- DOMContentUtils - Class in org.apache.nutch.parse.tika
-
A collection of methods for extracting content from DOM trees.
- DOMContentUtils(Configuration) - Constructor for class org.apache.nutch.parse.tika.DOMContentUtils
-
- DOMContentUtils.LinkParams - Class in org.apache.nutch.parse.html
-
- DOMContentUtils.LinkParams(String, String, int) - Constructor for class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
-
- DomUtil - Class in org.apache.nutch.util
-
- DomUtil() - Constructor for class org.apache.nutch.util.DomUtil
-
- DublinCore - Interface in org.apache.nutch.metadata
-
A collection of Dublin Core metadata names.
- DummySSLProtocolSocketFactory - Class in org.apache.nutch.protocol.httpclient
-
- DummySSLProtocolSocketFactory() - Constructor for class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
Constructor for DummySSLProtocolSocketFactory.
- DummyX509TrustManager - Class in org.apache.nutch.protocol.httpclient
-
- DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
Constructor for DummyX509TrustManager.
- dump(Path, Path) - Method in class org.apache.nutch.segment.SegmentReader
-
- DUMP_DIR - Static variable in class org.apache.nutch.scoring.webgraph.LinkDumper
-
- dumpLinks(Path) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper
-
Runs the inverter and merger jobs of the LinkDumper tool to create the
url to inlink node database.
- dumpNodes(Path, NodeDumper.DumpType, long, Path, boolean, NodeDumper.NameType, NodeDumper.AggrType, boolean) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper
-
Runs the process to dump the top urls out to a text file.
- dumpUrl(Path, String) - Method in class org.apache.nutch.scoring.webgraph.LoopReader
-
Prints loopset for a single url.
- dumpUrl(Path, String) - Method in class org.apache.nutch.scoring.webgraph.NodeReader
-
Prints the content of the Node represented by the url to system out.
- FAILED - Static variable in class org.apache.nutch.parse.ParseStatus
-
General failure.
- FAILED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Content was not retrieved.
- FAILED_EXCEPTION - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_INVALID_FORMAT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_MISSING_CONTENT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_MISSING_PARTS - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_TRUNCATED - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FakeHandler - Class in org.apache.nutch.tools.proxy
-
- FakeHandler() - Constructor for class org.apache.nutch.tools.proxy.FakeHandler
-
- Feed - Interface in org.apache.nutch.metadata
-
A collection of Feed property names extracted by the ROME library.
- FEED - Static variable in interface org.apache.nutch.metadata.Feed
-
- FEED_AUTHOR - Static variable in interface org.apache.nutch.metadata.Feed
-
- FEED_PUBLISHED - Static variable in interface org.apache.nutch.metadata.Feed
-
- FEED_TAGS - Static variable in interface org.apache.nutch.metadata.Feed
-
- FEED_UPDATED - Static variable in interface org.apache.nutch.metadata.Feed
-
- FeedIndexingFilter - Class in org.apache.nutch.indexer.feed
-
- FeedIndexingFilter() - Constructor for class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
- FeedParser - Class in org.apache.nutch.parse.feed
-
- FeedParser() - Constructor for class org.apache.nutch.parse.feed.FeedParser
-
- fetch(Path, int) - Method in class org.apache.nutch.fetcher.Fetcher
-
- fetch(Path, int) - Method in class org.apache.nutch.fetcher.OldFetcher
-
- FETCH_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- FETCH_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- FETCH_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- fetched - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- Fetcher - Class in org.apache.nutch.fetcher
-
A queue-based fetcher.
- Fetcher() - Constructor for class org.apache.nutch.fetcher.Fetcher
-
- Fetcher(Configuration) - Constructor for class org.apache.nutch.fetcher.Fetcher
-
- Fetcher.InputFormat - Class in org.apache.nutch.fetcher
-
- Fetcher.InputFormat() - Constructor for class org.apache.nutch.fetcher.Fetcher.InputFormat
-
- FetcherOutputFormat - Class in org.apache.nutch.fetcher
-
Splits FetcherOutput entries into multiple map files.
- FetcherOutputFormat() - Constructor for class org.apache.nutch.fetcher.FetcherOutputFormat
-
- fetchErrors - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- FetchSchedule - Interface in org.apache.nutch.crawl
-
This interface defines the contract for implementations that manipulate
fetch times and re-fetch intervals.
- FetchScheduleFactory - Class in org.apache.nutch.crawl
-
- FIELD - Static variable in class org.creativecommons.nutch.CCIndexingFilter
-
The name of the document field we use.
- fieldName - Static variable in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
Doc field name
- File - Class in org.apache.nutch.protocol.file
-
This class is a protocol plugin used for file: scheme.
- File() - Constructor for class org.apache.nutch.protocol.file.File
-
- FileError - Exception in org.apache.nutch.protocol.file
-
Thrown for File error codes.
- FileError(int) - Constructor for exception org.apache.nutch.protocol.file.FileError
-
- FileException - Exception in org.apache.nutch.protocol.file
-
- FileException() - Constructor for exception org.apache.nutch.protocol.file.FileException
-
- FileException(String) - Constructor for exception org.apache.nutch.protocol.file.FileException
-
- FileException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.file.FileException
-
- FileException(Throwable) - Constructor for exception org.apache.nutch.protocol.file.FileException
-
- fileLen - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- FileResponse - Class in org.apache.nutch.protocol.file
-
FileResponse.java mimics file replies as http response.
- FileResponse(URL, CrawlDatum, File, Configuration) - Constructor for class org.apache.nutch.protocol.file.FileResponse
-
Default public constructor
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
Scan the HTML document looking at possible indications of content
language
1.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
- filter(String) - Method in class org.apache.nutch.collection.Subcollection
-
Simple "indexOf" currentFilter for matching patterns.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
The
AnchorIndexingFilter
filter object which supports boolean
configuration settings for the deduplication of anchors.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
The
BasicIndexingFilter
filter object which supports few
configuration settings for adding basic searchable fields.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
Extracts out the relevant fields:
FEED_AUTHOR
FEED_TAGS
FEED_PUBLISHED
FEED_UPDATED
FEED
And sends them to the Indexer
for indexing within the Nutch
index.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in interface org.apache.nutch.indexer.IndexingFilter
-
Adds fields or otherwise modifies the document that will be indexed for a
parse.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.IndexingFilters
-
Run all defined filters.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
-
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the CrawlDatum object.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.microformats.reltag.RelTagParser
-
Scan the HTML document looking at possible rel-tags
- filter(String) - Method in interface org.apache.nutch.net.URLFilter
-
- filter(String) - Method in class org.apache.nutch.net.URLFilters
-
Run all defined filters.
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in interface org.apache.nutch.parse.HtmlParseFilter
-
Adds metadata or otherwise modifies a parse of HTML content, given
the DOM tree of a page.
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.HtmlParseFilters
-
Run all defined filters.
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.js.JSParseFilter
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.MetaTagsParser
-
- filter() - Method in class org.apache.nutch.parse.ParseResult
-
Remove all results where status is not successful (as determined
by ParseStatus#isSuccess()).
- filter(Text, CrawlDatum, CrawlDatum, CrawlDatum, Content, ParseData, ParseText, Collection<CrawlDatum>) - Method in interface org.apache.nutch.segment.SegmentMergeFilter
-
The filtering method which gets all information being merged for a given
key (URL).
- filter(Text, CrawlDatum, CrawlDatum, CrawlDatum, Content, ParseData, ParseText, Collection<CrawlDatum>) - Method in class org.apache.nutch.segment.SegmentMergeFilters
-
Iterates over all
SegmentMergeFilter
extensions and if any of them
returns false, it will return false as well.
- filter(String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- filter(String) - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
- filter(String) - Method in class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
- filter(String) - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- filter(String) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- filter(String) - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
-
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.creativecommons.nutch.CCIndexingFilter
-
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.creativecommons.nutch.CCParseFilter
-
Adds metadata or otherwise modifies a parse of an HTML document, given
the DOM tree of a page.
- filterNormalize(String, String, String, boolean, URLFilters, URLNormalizers) - Static method in class org.apache.nutch.parse.ParseOutputFormat
-
- finalize() - Method in class org.apache.nutch.plugin.Plugin
-
- finalize() - Method in class org.apache.nutch.plugin.PluginRepository
-
- finalize() - Method in class org.apache.nutch.protocol.ftp.Ftp
-
- findAuthentication(Metadata) - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
- findLoops(Path) - Method in class org.apache.nutch.scoring.webgraph.Loops
-
Runs the various loop jobs.
- FIXED_INTERVAL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
Used by AdaptiveFetchSchedule to maintain custom fetch interval
- FORBID_ALL_RULES - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
A BaseRobotRules
object appropriate for use when the
robots.txt
file is not fetched due to a 403/Forbidden
response; all requests are disallowed.
- forceRefetch(Text, CrawlDatum, boolean) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method resets fetchTime, fetchInterval, modifiedTime,
retriesSinceFetch and page signature, so that it forces refetching.
- forceRefetch(Text, CrawlDatum, boolean) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method resets fetchTime, fetchInterval, modifiedTime and
page signature, so that it forces refetching.
- FORMAT - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Typically, Format may include the media-type or dimensions of the
resource.
- format - Static variable in class org.apache.nutch.net.protocols.HttpDateFormat
-
- forName(String) - Method in class org.apache.nutch.util.MimeUtil
-
A facade interface to Tika's underlying MimeTypes.forName(String)
method.
- FreeGenerator - Class in org.apache.nutch.tools
-
This tool generates fetchlists (segments to be fetched) from plain text
files containing one URL per line.
- FreeGenerator() - Constructor for class org.apache.nutch.tools.FreeGenerator
-
- FreeGenerator.FG - Class in org.apache.nutch.tools
-
- FreeGenerator.FG() - Constructor for class org.apache.nutch.tools.FreeGenerator.FG
-
- fromHexString(String) - Static method in class org.apache.nutch.util.StringUtil
-
Convert a String containing consecutive (no inside whitespace) hexadecimal
digits into a corresponding byte array.
- FSUtils - Class in org.apache.nutch.util
-
Utility methods for common filesystem operations.
- FSUtils() - Constructor for class org.apache.nutch.util.FSUtils
-
- Ftp - Class in org.apache.nutch.protocol.ftp
-
This class is a protocol plugin used for ftp: scheme.
- Ftp() - Constructor for class org.apache.nutch.protocol.ftp.Ftp
-
- FtpError - Exception in org.apache.nutch.protocol.ftp
-
Thrown for Ftp error codes.
- FtpError(int) - Constructor for exception org.apache.nutch.protocol.ftp.FtpError
-
- FtpException - Exception in org.apache.nutch.protocol.ftp
-
Superclass for important exceptions thrown during FTP talk,
that must be handled with care.
- FtpException() - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
-
- FtpException(String) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
-
- FtpException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
-
- FtpException(Throwable) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
-
- FtpExceptionBadSystResponse - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating bad reply of SYST command.
- FtpExceptionCanNotHaveDataConnection - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating failure of opening data connection.
- FtpExceptionControlClosedByForcedDataClose - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating control channel is closed by server end, due to
forced closure of data channel at client (our) end.
- FtpExceptionUnknownForcedDataClose - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating unrecognizable reply from server after
forced closure of data channel by client (our) side.
- FtpResponse - Class in org.apache.nutch.protocol.ftp
-
FtpResponse.java mimics ftp replies as http response.
- FtpResponse(URL, CrawlDatum, Ftp, Configuration) - Constructor for class org.apache.nutch.protocol.ftp.FtpResponse
-
- FtpRobotRulesParser - Class in org.apache.nutch.protocol.ftp
-
This class is used for parsing robots for urls belonging to FTP protocol.
- FtpRobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
-
- generate(Path, Path, int, long, long) - Method in class org.apache.nutch.crawl.Generator
-
- generate(Path, Path, int, long, long, boolean, boolean) - Method in class org.apache.nutch.crawl.Generator
-
old signature used for compatibility - does not specify whether or not to
normalise and set the number of segments to 1
- generate(Path, Path, int, long, long, boolean, boolean, boolean, int) - Method in class org.apache.nutch.crawl.Generator
-
Generate fetchlists in one or more segments.
- GENERATE_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- GENERATE_MAX_PER_HOST_BY_IP - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATE_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- GENERATE_UPDATE_CRAWLDB - Static variable in class org.apache.nutch.crawl.Generator
-
- generated - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- generateFileNameForKeyValue(FloatWritable, Generator.SelectorEntry, String) - Method in class org.apache.nutch.crawl.Generator.GeneratorOutputFormat
-
- generateSegmentName() - Static method in class org.apache.nutch.crawl.Generator
-
- generateSegmentName() - Static method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Generates a random name for the segments.
- Generator - Class in org.apache.nutch.crawl
-
Generates a subset of a crawl db to fetch.
- Generator() - Constructor for class org.apache.nutch.crawl.Generator
-
- Generator(Configuration) - Constructor for class org.apache.nutch.crawl.Generator
-
- Generator.CrawlDbUpdater - Class in org.apache.nutch.crawl
-
Update the CrawlDB so that the next generate won't include the same URLs.
- Generator.CrawlDbUpdater() - Constructor for class org.apache.nutch.crawl.Generator.CrawlDbUpdater
-
- Generator.DecreasingFloatComparator - Class in org.apache.nutch.crawl
-
- Generator.DecreasingFloatComparator() - Constructor for class org.apache.nutch.crawl.Generator.DecreasingFloatComparator
-
- Generator.GeneratorOutputFormat - Class in org.apache.nutch.crawl
-
- Generator.GeneratorOutputFormat() - Constructor for class org.apache.nutch.crawl.Generator.GeneratorOutputFormat
-
- Generator.HashComparator - Class in org.apache.nutch.crawl
-
Sort fetch lists by hash of URL.
- Generator.HashComparator() - Constructor for class org.apache.nutch.crawl.Generator.HashComparator
-
- Generator.PartitionReducer - Class in org.apache.nutch.crawl
-
- Generator.PartitionReducer() - Constructor for class org.apache.nutch.crawl.Generator.PartitionReducer
-
- Generator.Selector - Class in org.apache.nutch.crawl
-
Selects entries due for fetch.
- Generator.Selector() - Constructor for class org.apache.nutch.crawl.Generator.Selector
-
- Generator.SelectorEntry - Class in org.apache.nutch.crawl
-
- Generator.SelectorEntry() - Constructor for class org.apache.nutch.crawl.Generator.SelectorEntry
-
- Generator.SelectorInverseMapper - Class in org.apache.nutch.crawl
-
- Generator.SelectorInverseMapper() - Constructor for class org.apache.nutch.crawl.Generator.SelectorInverseMapper
-
- GENERATOR_COUNT_MODE - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_COUNT_VALUE_DOMAIN - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_COUNT_VALUE_HOST - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_CUR_TIME - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_DELAY - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_FILTER - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_MAX_COUNT - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_MAX_NUM_SEGMENTS - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_MIN_INTERVAL - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_MIN_SCORE - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_NORMALISE - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_RESTRICT_STATUS - Static variable in class org.apache.nutch.crawl.Generator
-
- GENERATOR_TOP_N - Static variable in class org.apache.nutch.crawl.Generator
-
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
- generatorSortValue(Text, CrawlDatum, float) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method prepares a sort value for the purpose of sorting and
selecting top N scoring pages during fetchlist generation.
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a sort value for Generate.
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- GenericWritableConfigurable - Class in org.apache.nutch.util
-
A generic Writable wrapper that can inject Configuration to Configurable
s
- GenericWritableConfigurable() - Constructor for class org.apache.nutch.util.GenericWritableConfigurable
-
- get(String, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- get(Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- get(String) - Method in class org.apache.nutch.metadata.Metadata
-
Get the value associated to a metadata name.
- get(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
- get(String) - Method in class org.apache.nutch.parse.ParseResult
-
Retrieve a single parse output.
- get(Text) - Method in class org.apache.nutch.parse.ParseResult
-
Retrieve a single parse output.
- get(Configuration) - Static method in class org.apache.nutch.plugin.PluginRepository
-
- get(FileSplit) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a FileSplit.
- get(String) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a full path of a location inside any segment part.
- get(Path, Text, Writer, Map<String, List<Writable>>) - Method in class org.apache.nutch.segment.SegmentReader
-
- get(String) - Method in class org.apache.nutch.util.domain.DomainSuffixes
-
Return the
DomainSuffix
object for the extension, if
extension is a top level domain returned object will be an
instance of
TopLevelDomain
- get(Configuration) - Static method in class org.apache.nutch.util.ObjectCache
-
- getAccept() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getAcceptedIssuers() - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- getAcceptLanguage() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
Value of "Accept-Language" request header sent by Nutch.
- getAll() - Method in class org.apache.nutch.collection.CollectionManager
-
Returns all collections
- getAnchor() - Method in class org.apache.nutch.crawl.Inlink
-
- getAnchor() - Method in class org.apache.nutch.parse.Outlink
-
- getAnchor() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getAnchors() - Method in class org.apache.nutch.crawl.Inlinks
-
Return the set of anchor texts.
- getAnchors(Text) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- getArgs() - Method in class org.apache.nutch.parse.ParseStatus
-
- getArgs() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getAttribute(String) - Method in class org.apache.nutch.plugin.Extension
-
Returns a attribute value, that is setuped in the manifest file and is
definied by the extension point xml schema.
- getAuthentication(String, Configuration) - Static method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
This method is responsible for providing Basic authentication information.
- getBase(Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
If Node contains a BASE tag then it's HREF is returned.
- getBaseHref() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getBaseUrl() - Method in class org.apache.nutch.protocol.Content
-
The base url for relative links contained in the content.
- getBasicPattern() - Static method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Provides a pattern which can be used by an outside resource to determine if
this class can provide credentials based on simple header information.
- getBlackListString() - Method in class org.apache.nutch.collection.Subcollection
-
Returns blacklist String
- getBoost() - Method in class org.apache.nutch.util.domain.DomainSuffix
-
- getBufferSize() - Method in class org.apache.nutch.protocol.ftp.Ftp
-
- getCachedClass(PluginDescriptor, String) - Method in class org.apache.nutch.plugin.PluginRepository
-
- getClassLoader() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a cached classloader for a plugin.
- getClazz() - Method in class org.apache.nutch.plugin.Extension
-
Returns the full class name of the extension point implementation
- getCode() - Method in interface org.apache.nutch.net.protocols.Response
-
Returns the response code.
- getCode(int) - Method in exception org.apache.nutch.protocol.file.FileError
-
- getCode() - Method in class org.apache.nutch.protocol.file.FileResponse
-
Returns the response code.
- getCode(int) - Method in exception org.apache.nutch.protocol.ftp.FtpError
-
- getCode() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
-
Returns the response code.
- getCode() - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getCode() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getCode() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getCollectionManager(Configuration) - Static method in class org.apache.nutch.collection.CollectionManager
-
- getCommand() - Method in class org.apache.nutch.util.CommandRunner
-
- getCommonsHttpSolrServer(JobConf) - Static method in class org.apache.nutch.indexwriter.solr.SolrUtils
-
- getConf() - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
- getConf() - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
- getConf() - Method in class org.apache.nutch.crawl.Signature
-
- getConf() - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.indexer.CleaningJob
-
- getConf() - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
- getConf() - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- getConf() - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
-
- getConf() - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- getConf() - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- getConf() - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
-
- getConf() - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
-
Boilerplate
- getConf() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- getConf() - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
-
- getConf() - Method in class org.apache.nutch.microformats.reltag.RelTagParser
-
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
-
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
-
- getConf() - Method in class org.apache.nutch.parse.ext.ExtParser
-
- getConf() - Method in class org.apache.nutch.parse.feed.FeedParser
-
- getConf() - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
- getConf() - Method in class org.apache.nutch.parse.html.HtmlParser
-
- getConf() - Method in class org.apache.nutch.parse.js.JSParseFilter
-
- getConf() - Method in class org.apache.nutch.parse.MetaTagsParser
-
- getConf() - Method in class org.apache.nutch.parse.ParserChecker
-
- getConf() - Method in class org.apache.nutch.parse.swf.SWFParser
-
- getConf() - Method in class org.apache.nutch.parse.tika.TikaParser
-
- getConf() - Method in class org.apache.nutch.parse.zip.ZipParser
-
- getConf() - Method in class org.apache.nutch.protocol.file.File
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getConf() - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
- getConf() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
- getConf() - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Get the Configuration
object
- getConf() - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- getConf() - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- getConf() - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
- getConf() - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- getConf() - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- getConf() - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- getConf() - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
- getConf() - Method in class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
- getConf() - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- getConf() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- getConf() - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
-
- getConf() - Method in class org.apache.nutch.util.GenericWritableConfigurable
-
- getConf() - Method in class org.creativecommons.nutch.CCIndexingFilter
-
- getConf() - Method in class org.creativecommons.nutch.CCParseFilter
-
- getContent() - Method in interface org.apache.nutch.net.protocols.Response
-
Returns the full content of the response.
- getContent() - Method in class org.apache.nutch.protocol.Content
-
The binary content retrieved.
- getContent() - Method in class org.apache.nutch.protocol.file.FileResponse
-
- getContent() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
-
- getContent() - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getContent() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getContent() - Method in class org.apache.nutch.protocol.ProtocolOutput
-
- getContentMeta() - Method in class org.apache.nutch.parse.ParseData
-
The original Metadata retrieved from content
- getContentType() - Method in exception org.apache.nutch.parse.ParserNotFound
-
- getContentType() - Method in class org.apache.nutch.protocol.Content
-
The media type of the retrieved content.
- getCopyMap() - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- getCountryName() - Method in class org.apache.nutch.util.domain.TopLevelDomain
-
Returns the country name if TLD is Country Code TLD
- getCrawlDelay() - Method in interface org.apache.nutch.protocol.RobotRules
-
Get Crawl-Delay, in milliseconds.
- getCredentials() - Method in interface org.apache.nutch.protocol.httpclient.HttpAuthentication
-
Gets the credentials generated by the HttpAuthentication
object.
- getCredentials() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Gets the Basic credentials generated by this
HttpBasicAuthentication object
- getCurrentNode() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Get the node currently being processed.
- getData() - Method in interface org.apache.nutch.parse.Parse
-
Other data extracted from the page.
- getData() - Method in class org.apache.nutch.parse.ParseImpl
-
- getDependencies() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of plugin ids.
- getDescriptor() - Method in class org.apache.nutch.plugin.Extension
-
return the plugin descriptor.
- getDescriptor() - Method in class org.apache.nutch.plugin.Plugin
-
Returns the plugin descriptor
- getDocumentMeta() - Method in class org.apache.nutch.indexer.NutchDocument
-
- getDom(InputStream) - Static method in class org.apache.nutch.util.DomUtil
-
Returns parsed dom tree or null if any error
- getDomain() - Method in class org.apache.nutch.util.domain.DomainSuffix
-
- getDomainName(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the domain name of the url.
- getDomainName(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the domain name of the url.
- getDomainSuffix(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the
DomainSuffix
corresponding to the
last public part of the hostname
- getDomainSuffix(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the
DomainSuffix
corresponding to the
last public part of the hostname
- getElement(DocumentFragment, String) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
Finds the specified element and returns its value
- getEmptyParse(Configuration) - Method in class org.apache.nutch.parse.ParseStatus
-
A convenience method.
- getEmptyParseResult(String, Configuration) - Method in class org.apache.nutch.parse.ParseStatus
-
A convenience method.
- getExitValue() - Method in class org.apache.nutch.util.CommandRunner
-
- getExpireTime() - Method in interface org.apache.nutch.protocol.RobotRules
-
Get expire time
- getExportedLibUrls() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array exported librareis as URLs
- getExtensionInstance() - Method in class org.apache.nutch.plugin.Extension
-
Return an instance of the extension implementatio.
- getExtensionPoint(String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns a extension point indentified by a extension point id.
- getExtensions(String) - Method in class org.apache.nutch.parse.ParserFactory
-
Finds the best-suited parse plugin for a given contentType.
- getExtensions() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns a array of extensions that lsiten to this extension point
- getExtensions() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns an array of extensions.
- getExtenstionPoints() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of extension points.
- getFetchInterval() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getFetchSchedule(Configuration) - Static method in class org.apache.nutch.crawl.FetchScheduleFactory
-
Return the FetchSchedule implementation.
- getFetchTime() - Method in class org.apache.nutch.crawl.CrawlDatum
-
Returns either the time of the last fetch, or the next fetch time,
depending on whether Fetcher or CrawlDbReducer set the time.
- getField(String) - Method in class org.apache.nutch.indexer.NutchDocument
-
- getFieldNames() - Method in class org.apache.nutch.indexer.NutchDocument
-
- getFieldValue(String) - Method in class org.apache.nutch.indexer.NutchDocument
-
- getFromUrl() - Method in class org.apache.nutch.crawl.Inlink
-
- getGeneralTags() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Returns all collected values of the general meta tags.
- getHeader(String) - Method in interface org.apache.nutch.net.protocols.Response
-
Returns the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.file.FileResponse
-
Returns the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.ftp.FtpResponse
-
Returns the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getHeader(String) - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getHeaders() - Method in interface org.apache.nutch.net.protocols.Response
-
Returns all the headers.
- getHeaders() - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getHeaders() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getHost(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the lowercased hostname for the url or null if the url is not well
formed.
- getHostSegments(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Partitions of the hostname of the url by "."
- getHostSegments(String) - Static method in class org.apache.nutch.util.URLUtil
-
Partitions of the hostname of the url by "."
- getHttpEquivTags() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Returns all collected values of the "http-equiv" meta tags.
- getId() - Method in class org.apache.nutch.collection.Subcollection
-
- getId() - Method in class org.apache.nutch.plugin.Extension
-
Return the unique id of the extension.
- getId() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns the unique id of the extension point.
- getInlinks(Text) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- getInlinkScore() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getInstance(Configuration) - Static method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- getInstance() - Static method in class org.apache.nutch.util.domain.DomainSuffixes
-
Singleton instance, lazy instantination
- getKey() - Method in class org.apache.nutch.collection.Subcollection
-
- getKeyMap() - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- getLastModified() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getLinks() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- getLinkType() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getLookingFor() - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- getLoopSet() - Method in class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- getMajorCode() - Method in class org.apache.nutch.parse.ParseStatus
-
- getMaxContent() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getMessage() - Method in class org.apache.nutch.parse.ParseStatus
-
A convenience method.
- getMessage() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getMeta(String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get metadata.
- getMeta(String) - Method in class org.apache.nutch.parse.ParseData
-
Get a metadata single value.
- getMetaData() - Method in class org.apache.nutch.crawl.CrawlDatum
-
returns a MapWritable if it was set or read in @see readFields(DataInput),
returns empty map in case CrawlDatum was freshly created (lazily instantiated).
- getMetadata() - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get all metadata.
- getMetadata() - Method in class org.apache.nutch.parse.Outlink
-
- getMetadata() - Method in class org.apache.nutch.protocol.Content
-
Other protocol-specific data.
- getMetadata() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getMetaTags(HTMLMetaTags, Node, URL) - Static method in class org.apache.nutch.parse.html.HTMLMetaProcessor
-
Sets the indicators in robotsMeta
to appropriate
values, based on any META tags found under the given
node
.
- getMetaTags(HTMLMetaTags, Node, URL) - Static method in class org.apache.nutch.parse.tika.HTMLMetaProcessor
-
Sets the indicators in robotsMeta
to appropriate
values, based on any META tags found under the given
node
.
- getMetaValues(String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get multiple metadata.
- getMimeType(String) - Method in class org.apache.nutch.util.MimeUtil
-
Facade interface to Tika's underlying MimeTypes.getMimeType(String)
method.
- getMimeType(File) - Method in class org.apache.nutch.util.MimeUtil
-
Facade interface to Tika's underlying MimeTypes.getMimeType(File)
method.
- getMinorCode() - Method in class org.apache.nutch.parse.ParseStatus
-
- getModifiedTime() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getName() - Method in class org.apache.nutch.collection.Subcollection
-
- getName() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns the name of the extension point.
- getName() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the name of the plugin.
- getName() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- getNoCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getNode() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- getNodeValue(Node) - Static method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
Returns the text value of the specified Node and child nodes
- getNoFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getNoIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getNormalizedName(String) - Static method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
Get the normalized name of metadata attribute name.
- getNotExportedLibUrls() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of libraries as URLs that are not exported by the plugin.
- getNumInlinks() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getNumOutlinks() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getObject(String) - Method in class org.apache.nutch.util.ObjectCache
-
- getOrderedPlugins(Class<?>, String, String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Get ordered list of plugins.
- getOutlinks(URL, ArrayList<Outlink>, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This method finds all anchors below the supplied DOM
node
, and creates appropriate
Outlink
records for each (relative to the supplied
base
URL), and adds them to the
outlinks
ArrayList
.
- getOutlinks(String, Configuration) - Static method in class org.apache.nutch.parse.OutlinkExtractor
-
Extracts Outlink
from given plain text.
- getOutlinks(String, String, Configuration) - Static method in class org.apache.nutch.parse.OutlinkExtractor
-
Extracts Outlink
from given plain text and adds anchor
to the extracted Outlink
s
- getOutlinks() - Method in class org.apache.nutch.parse.ParseData
-
The outlinks of the page.
- getOutlinks(URL, ArrayList<Outlink>, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
This method finds all anchors below the supplied DOM
node
, and creates appropriate
Outlink
records for each (relative to the supplied
base
URL), and adds them to the
outlinks
ArrayList
.
- getOutlinkScore() - Method in class org.apache.nutch.scoring.webgraph.Node
-
- getOutlinkUrl() - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- getPage(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the page for the url.
- getParse(Content) - Method in class org.apache.nutch.parse.ext.ExtParser
-
- getParse(Content) - Method in class org.apache.nutch.parse.feed.FeedParser
-
Parses the given feed and extracts out and parsers all linked items within
the feed, using the underlying ROME feed parsing library.
- getParse(Content) - Method in class org.apache.nutch.parse.html.HtmlParser
-
- getParse(Content) - Method in class org.apache.nutch.parse.js.JSParseFilter
-
- getParse(Content) - Method in interface org.apache.nutch.parse.Parser
-
This method parses the given content and returns a map of
<key, parse> pairs.
- getParse(Content) - Method in class org.apache.nutch.parse.swf.SWFParser
-
- getParse(Content) - Method in class org.apache.nutch.parse.tika.TikaParser
-
- getParse(Content) - Method in class org.apache.nutch.parse.zip.ZipParser
-
- getParseMeta() - Method in class org.apache.nutch.parse.ParseData
-
Other content properties.
- getParserById(String) - Method in class org.apache.nutch.parse.ParserFactory
-
Function returns a
Parser
instance with the specified
extId
, representing its extension ID.
- getParsers(String, String) - Method in class org.apache.nutch.parse.ParserFactory
-
Function returns an array of
Parser
s for a given content type.
- getPartition(FloatWritable, Writable, int) - Method in class org.apache.nutch.crawl.Generator.Selector
-
Partition by host / domain or IP.
- getPartition(Text, Writable, int) - Method in class org.apache.nutch.crawl.URLPartitioner
-
Hash by domain name.
- getPassAllFilter() - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Returns PathFilter that passes all paths through.
- getPassDirectoriesFilter(FileSystem) - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Returns PathFilter that passes directories through.
- getPaths(FileStatus[]) - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Turns an array of FileStatus into an array of Paths.
- getPluginClass() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the fully qualified name of the class which implements the abstarct
Plugin
class.
- getPluginDescriptor(String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns the descriptor of one plugin identified by a plugin id.
- getPluginDescriptors() - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns all registed plugin descriptors.
- getPluginFolder(String) - Method in class org.apache.nutch.plugin.PluginManifestParser
-
Return the named plugin folder.
- getPluginId() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the unique identifier of the plug-in or null
.
- getPluginInstance(PluginDescriptor) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns a instance of a plugin.
- getPluginPath() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the directory path of the plugin.
- getPos() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns the current position in the file.
- getProgress() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns the percentage of progress in processing the file.
- getProtocol(String) - Method in class org.apache.nutch.protocol.ProtocolFactory
-
Returns the appropriate
Protocol
implementation for a url.
- getProtocol(String) - Static method in class org.apache.nutch.util.URLUtil
-
- getProtocol(URL) - Static method in class org.apache.nutch.util.URLUtil
-
- getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.file.File
-
- getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
- getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getProtocolOutput(Text, CrawlDatum) - Method in interface org.apache.nutch.protocol.Protocol
-
Returns the
Content
for a fetchlist entry.
- getProviderName() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
- getProxyHost() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getProxyPort() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getRealm() - Method in interface org.apache.nutch.protocol.httpclient.HttpAuthentication
-
Gets the realm used by the HttpAuthentication object during creation.
- getRealm() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Gets the realm attribute of the HttpBasicAuthentication object.
- getRecordReader(InputSplit, JobConf, Reporter) - Method in class org.apache.nutch.segment.ContentAsTextInputFormat
-
- getRecordReader(InputSplit, JobConf, Reporter) - Method in class org.apache.nutch.segment.SegmentMerger.ObjectInputFormat
-
- getRecordReader(InputSplit, JobConf, Reporter) - Method in class org.apache.nutch.tools.arc.ArcInputFormat
-
Returns the RecordReader
for reading the arc file.
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.fetcher.FetcherOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.indexer.IndexerOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.parse.ParseOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentOutputFormat
-
- getRecordWriter(FileSystem, JobConf, String, Progressable) - Method in class org.apache.nutch.segment.SegmentReader.TextOutputFormat
-
- getRefresh() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getRefreshHref() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getRefreshTime() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
A convenience method.
- getResourceString(String, Locale) - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a I18N'd resource string.
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.http.Http
-
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.httpclient.Http
-
Fetches the url
with a configured HTTP client and
gets the response.
- getRetriesSinceFetch() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getRobotRules(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.file.File
-
No robots parsing is done for file protocol.
- getRobotRules(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Get the robots rules for a given url
- getRobotRules(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getRobotRules(Text, CrawlDatum) - Method in interface org.apache.nutch.protocol.Protocol
-
Retrieve robot rules applicable for this url.
- getRobotRulesSet(Protocol, URL) - Method in class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
-
The hosts for which the caching of robots rules is yet to be done,
it sends a Ftp request to the host corresponding to the
URL
passed, gets robots file, parses the rules and caches the rules object
to avoid re-work in future.
- getRobotRulesSet(Protocol, URL) - Method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
-
The hosts for which the caching of robots rules is yet to be done,
it sends a Http request to the host corresponding to the
URL
passed, gets robots file, parses the rules and caches the rules object
to avoid re-work in future.
- getRobotRulesSet(Protocol, Text) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
- getRobotRulesSet(Protocol, URL) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
- getRootNode() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Get the root node of the DOM being created.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Returns the name of the file of rules to use for
a particular implementation.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
-
Rules specified as a config property will override rules specified
as a config file.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
Rules specified as a config property will override rules specified
as a config file.
- getRuns() - Method in class org.apache.nutch.tools.Benchmark.BenchmarkResults
-
- getSchema() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns a path to the xml schema of a extension point.
- getScopedRules() - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
- getScore() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getScore() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getSignature() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getSignature(Configuration) - Static method in class org.apache.nutch.crawl.SignatureFactory
-
Return the default Signature implementation.
- getSplits(JobConf, int) - Method in class org.apache.nutch.fetcher.Fetcher.InputFormat
-
Don't split inputs, to keep things polite.
- getSplits(JobConf, int) - Method in class org.apache.nutch.fetcher.OldFetcher.InputFormat
-
Don't split inputs, to keep things polite.
- getStages() - Method in class org.apache.nutch.tools.Benchmark.BenchmarkResults
-
- getStats(Path, SegmentReader.SegmentReaderStats) - Method in class org.apache.nutch.segment.SegmentReader
-
- getStatus() - Method in class org.apache.nutch.crawl.CrawlDatum
-
- getStatus() - Method in class org.apache.nutch.parse.ParseData
-
The status of parsing the page.
- getStatus() - Method in class org.apache.nutch.protocol.ProtocolOutput
-
- getStatus() - Method in class org.apache.nutch.util.domain.DomainSuffix
-
- getStatusName(byte) - Static method in class org.apache.nutch.crawl.CrawlDatum
-
- getSubColection(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Returns named subcollection
- getSubCollections(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Return names of collections url is part of
- getSystemName() - Method in class org.apache.nutch.protocol.ftp.Client
-
Fetches the system type name from the server and returns the string.
- getTargetPoint() - Method in class org.apache.nutch.plugin.Extension
-
Returns the Id of the extension point, that is implemented by this
extension.
- getText(StringBuffer, Node, boolean) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This method takes a
StringBuffer
and a DOM
Node
,
and will append all the content text found beneath the DOM node to
the
StringBuffer
.
- getText(StringBuffer, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
- getText() - Method in interface org.apache.nutch.parse.Parse
-
The textual content of the page.
- getText() - Method in class org.apache.nutch.parse.ParseImpl
-
- getText() - Method in class org.apache.nutch.parse.ParseText
-
- getText(StringBuffer, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
- getThrownError() - Method in class org.apache.nutch.util.CommandRunner
-
- getTimeout() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getTimeout() - Method in class org.apache.nutch.util.CommandRunner
-
- getTimestamp() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getTitle(StringBuffer, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This method takes a
StringBuffer
and a DOM
Node
,
and will append the content text found beneath the first
title
node to the
StringBuffer
.
- getTitle() - Method in class org.apache.nutch.parse.ParseData
-
The title of the page.
- getTitle(StringBuffer, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
This method takes a
StringBuffer
and a DOM
Node
,
and will append the content text found beneath the first
title
node to the
StringBuffer
.
- getTopLevelDomainName(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the top level domain name of the url.
- getTopLevelDomainName(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the top level domain name of the url.
- getToUrl() - Method in class org.apache.nutch.parse.Outlink
-
- getType() - Method in class org.apache.nutch.util.domain.TopLevelDomain
-
- getTypes() - Method in class org.apache.nutch.crawl.NutchWritable
-
- getUniqueKey() - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- getUrl() - Method in interface org.apache.nutch.net.protocols.Response
-
Returns the URL used to retrieve this response.
- getUrl() - Method in exception org.apache.nutch.parse.ParserNotFound
-
- getUrl() - Method in class org.apache.nutch.protocol.Content
-
The url fetched.
- getUrl() - Method in class org.apache.nutch.protocol.http.HttpResponse
-
- getUrl() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
-
- getUrl() - Method in exception org.apache.nutch.protocol.ProtocolNotFound
-
- getUrl() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- getUrl() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- getUseHttp11() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getUserAgent() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- getUUID(Configuration) - Static method in class org.apache.nutch.util.NutchConfiguration
-
Retrieve a Nutch UUID of this configuration object, or null
if the configuration was created elsewhere.
- getValues() - Method in class org.apache.nutch.indexer.NutchField
-
- getValues(String) - Method in class org.apache.nutch.metadata.Metadata
-
Get the values associated to a metadata name.
- getValues(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
- getVersion() - Method in class org.apache.nutch.parse.ParseData
-
- getVersion() - Method in class org.apache.nutch.parse.ParseStatus
-
- getVersion() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
- getWaitForExit() - Method in class org.apache.nutch.util.CommandRunner
-
- getWeight() - Method in class org.apache.nutch.indexer.NutchDocument
-
- getWeight() - Method in class org.apache.nutch.indexer.NutchField
-
- getWhiteList() - Method in class org.apache.nutch.collection.Subcollection
-
Returns whitelist
- getWhiteListString() - Method in class org.apache.nutch.collection.Subcollection
-
Returns whitelist String
- getWriter() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Return null since there is no Writer for this class.
- GONE - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Resource is gone.
- guessEncoding(Content, String) - Method in class org.apache.nutch.util.EncodingDetector
-
Guess the encoding with the previously specified list of clues.
- GZIPUtils - Class in org.apache.nutch.util
-
A collection of utility methods for working on GZIPed data.
- GZIPUtils() - Constructor for class org.apache.nutch.util.GZIPUtils
-
- IDENTIFIER - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Recommended best practice is to identify the resource by means of a
string or number conforming to a formal identification system.
- ignorableWhitespace(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of ignorable whitespace in element content.
- IGNORE_INTERNAL_LINKS - Static variable in class org.apache.nutch.crawl.LinkDb
-
- in - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- INC_RATE - Variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- index(Path, Path, List<Path>, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
-
- index(Path, Path, List<Path>, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
-
- index(Path, Path, List<Path>, boolean, boolean, String) - Method in class org.apache.nutch.indexer.IndexingJob
-
- index(Path, Path, List<Path>, boolean, boolean, String, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
-
- INDEXER_DELETE - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- INDEXER_DELETE_ROBOTS_NOINDEX - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- INDEXER_PARAMS - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- INDEXER_SKIP_NOTMODIFIED - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- IndexerMapReduce - Class in org.apache.nutch.indexer
-
- IndexerMapReduce() - Constructor for class org.apache.nutch.indexer.IndexerMapReduce
-
- IndexerOutputFormat - Class in org.apache.nutch.indexer
-
- IndexerOutputFormat() - Constructor for class org.apache.nutch.indexer.IndexerOutputFormat
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Dampen the boost value by scorePower.
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method calculates a Lucene document boost.
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.ScoringFilters
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- IndexingException - Exception in org.apache.nutch.indexer
-
- IndexingException() - Constructor for exception org.apache.nutch.indexer.IndexingException
-
- IndexingException(String) - Constructor for exception org.apache.nutch.indexer.IndexingException
-
- IndexingException(String, Throwable) - Constructor for exception org.apache.nutch.indexer.IndexingException
-
- IndexingException(Throwable) - Constructor for exception org.apache.nutch.indexer.IndexingException
-
- IndexingFilter - Interface in org.apache.nutch.indexer
-
Extension point for indexing.
- INDEXINGFILTER_ORDER - Static variable in class org.apache.nutch.indexer.IndexingFilters
-
- IndexingFilters - Class in org.apache.nutch.indexer
-
- IndexingFilters(Configuration) - Constructor for class org.apache.nutch.indexer.IndexingFilters
-
- IndexingFiltersChecker - Class in org.apache.nutch.indexer
-
Reads and parses a URL and run the indexers on it.
- IndexingFiltersChecker() - Constructor for class org.apache.nutch.indexer.IndexingFiltersChecker
-
- IndexingJob - Class in org.apache.nutch.indexer
-
Generic indexer which relies on the plugins implementing IndexWriter
- IndexingJob() - Constructor for class org.apache.nutch.indexer.IndexingJob
-
- IndexingJob(Configuration) - Constructor for class org.apache.nutch.indexer.IndexingJob
-
- IndexWriter - Interface in org.apache.nutch.indexer
-
- IndexWriters - Class in org.apache.nutch.indexer
-
- IndexWriters(Configuration) - Constructor for class org.apache.nutch.indexer.IndexWriters
-
- inflate(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array.
- inflateBestEffort(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array.
- inflateBestEffort(byte[], int) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array, truncated to
sizeLimit
bytes, if necessary.
- init() - Method in class org.apache.nutch.collection.CollectionManager
-
- init(Path) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- init(FilterConfig) - Method in class org.apache.nutch.tools.proxy.LogDebugHandler
-
- initialize(Element) - Method in class org.apache.nutch.collection.Subcollection
-
Initialize Subcollection from dom element
- initializeSchedule(Text, CrawlDatum) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
Initialize fetch schedule related data.
- initializeSchedule(Text, CrawlDatum) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Initialize fetch schedule related data.
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Set to 0.0f (unknown value) - inlink contributions will bring it to
a correct level.
- initialScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Set an initial score for newly discovered pages.
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a new initial score, used when adding newly discovered pages.
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- initMRJob(Path, Path, Collection<Path>, JobConf) - Static method in class org.apache.nutch.indexer.IndexerMapReduce
-
- inject(Path, Path) - Method in class org.apache.nutch.crawl.Injector
-
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
- injectedScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Set an initial score for newly injected pages.
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a new initial score, used when injecting new pages.
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Boilerplate
- Injector - Class in org.apache.nutch.crawl
-
This class takes a flat file of URLs and adds them to the of pages to be
crawled.
- Injector() - Constructor for class org.apache.nutch.crawl.Injector
-
- Injector(Configuration) - Constructor for class org.apache.nutch.crawl.Injector
-
- Injector.InjectMapper - Class in org.apache.nutch.crawl
-
Normalize and filter injected urls.
- Injector.InjectMapper() - Constructor for class org.apache.nutch.crawl.Injector.InjectMapper
-
- Injector.InjectReducer - Class in org.apache.nutch.crawl
-
Combine multiple new entries for a url.
- Injector.InjectReducer() - Constructor for class org.apache.nutch.crawl.Injector.InjectReducer
-
- Inlink - Class in org.apache.nutch.crawl
-
- Inlink() - Constructor for class org.apache.nutch.crawl.Inlink
-
- Inlink(String, String) - Constructor for class org.apache.nutch.crawl.Inlink
-
- INLINK - Static variable in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- INLINK_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
-
- Inlinks - Class in org.apache.nutch.crawl
-
- Inlinks() - Constructor for class org.apache.nutch.crawl.Inlinks
-
- install(JobConf, Path) - Static method in class org.apache.nutch.crawl.CrawlDb
-
- install(JobConf, Path) - Static method in class org.apache.nutch.crawl.LinkDb
-
- invert(Path, Path, boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDb
-
- invert(Path, Path[], boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDb
-
- isAllowed(URL) - Method in interface org.apache.nutch.protocol.RobotRules
-
Returns false
if the robots.txt
file
prohibits us from accessing the given url
, or
true
otherwise.
- isCanonical() - Method in interface org.apache.nutch.parse.Parse
-
Indicates if the parse is coming from a url or a sub-url
- isCanonical() - Method in class org.apache.nutch.parse.ParseImpl
-
- isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- isDomainSuffix(String) - Method in class org.apache.nutch.util.domain.DomainSuffixes
-
return whether the extension is a registered domain entry
- isEmpty() - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- isEmpty() - Method in class org.apache.nutch.parse.ParseResult
-
Checks whether the result is empty.
- isEmpty(String) - Static method in class org.apache.nutch.util.StringUtil
-
Checks if a string is empty (ie is null or empty).
- isFound() - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- isIgnoreCase() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- isMagic(byte[]) - Static method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns true if the byte array passed matches the gzip header magic
number.
- isModeAccept() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- isMultiValued(String) - Method in class org.apache.nutch.metadata.Metadata
-
Returns true if named value is multivalued.
- isParsing(Configuration) - Static method in class org.apache.nutch.fetcher.Fetcher
-
- isParsing(Configuration) - Static method in class org.apache.nutch.fetcher.OldFetcher
-
- isPermanentFailure() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- isRemoteVerificationEnabled() - Method in class org.apache.nutch.protocol.ftp.Client
-
Return whether or not verification of the remote host participating
in data connections is enabled.
- isSameDomainName(URL, URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns whether the given urls have the same domain name.
- isSameDomainName(String, String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns whether the given urls have the same domain name.
- isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
- isStoringContent(Configuration) - Static method in class org.apache.nutch.fetcher.Fetcher
-
- isStoringContent(Configuration) - Static method in class org.apache.nutch.fetcher.OldFetcher
-
- isSuccess() - Method in class org.apache.nutch.parse.ParseResult
-
A convenience method which returns true only if all parses are successful.
- isSuccess() - Method in class org.apache.nutch.parse.ParseStatus
-
A convenience method.
- isSuccess() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- isTransientFailure() - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- isTruncated(Content) - Static method in class org.apache.nutch.parse.ParseSegment
-
Checks if the page's content is truncated.
- isWhiteSpace(char) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Returns whether the specified ch conforms to the XML 1.0 definition
of whitespace.
- isWhiteSpace(char[], int, int) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- isWhiteSpace(StringBuffer) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- isWhiteSpace(String) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- iterator() - Method in class org.apache.nutch.crawl.Inlinks
-
- iterator() - Method in class org.apache.nutch.indexer.NutchDocument
-
Iterate over all fields.
- iterator() - Method in class org.apache.nutch.parse.ParseResult
-
Iterate over all entries in the <url, Parse> map.
- LANGUAGE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A language of the intellectual content of the resource.
- LanguageIndexingFilter - Class in org.apache.nutch.analysis.lang
-
- LanguageIndexingFilter() - Constructor for class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
Constructs a new Language Indexing Filter.
- LAST_MODIFIED - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- leftPad(String, int) - Static method in class org.apache.nutch.util.StringUtil
-
Returns a copy of s
padded with leading spaces so
that it's length is length
.
- LICENSE_LOCATION - Static variable in interface org.apache.nutch.metadata.CreativeCommons
-
- LICENSE_URL - Static variable in interface org.apache.nutch.metadata.CreativeCommons
-
- LinkAnalysisScoringFilter - Class in org.apache.nutch.scoring.link
-
- LinkAnalysisScoringFilter() - Constructor for class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- LinkDatum - Class in org.apache.nutch.scoring.webgraph
-
A class for holding link information including the url, anchor text, a score,
the timestamp of the link and a link type.
- LinkDatum() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Default constructor, no url, timestamp, score, or link type.
- LinkDatum(String) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Creates a LinkDatum with a given url.
- LinkDatum(String, String) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Creates a LinkDatum with a url and an anchor text.
- LinkDatum(String, String, long) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
- LinkDb - Class in org.apache.nutch.crawl
-
Maintains an inverted link map, listing incoming links for each url.
- LinkDb() - Constructor for class org.apache.nutch.crawl.LinkDb
-
- LinkDb(Configuration) - Constructor for class org.apache.nutch.crawl.LinkDb
-
- LinkDbFilter - Class in org.apache.nutch.crawl
-
This class provides a way to separate the URL normalization
and filtering steps from the rest of LinkDb manipulation code.
- LinkDbFilter() - Constructor for class org.apache.nutch.crawl.LinkDbFilter
-
- LinkDbMerger - Class in org.apache.nutch.crawl
-
This tool merges several LinkDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited URLs and
links.
- LinkDbMerger() - Constructor for class org.apache.nutch.crawl.LinkDbMerger
-
- LinkDbMerger(Configuration) - Constructor for class org.apache.nutch.crawl.LinkDbMerger
-
- LinkDbReader - Class in org.apache.nutch.crawl
-
.
- LinkDbReader() - Constructor for class org.apache.nutch.crawl.LinkDbReader
-
- LinkDbReader(Configuration, Path) - Constructor for class org.apache.nutch.crawl.LinkDbReader
-
- LinkDumper - Class in org.apache.nutch.scoring.webgraph
-
The LinkDumper tool creates a database of node to inlink information that can
be read using the nested Reader class.
- LinkDumper() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper
-
- LinkDumper.Inverter - Class in org.apache.nutch.scoring.webgraph
-
Inverts outlinks from the WebGraph to inlinks and attaches node
information.
- LinkDumper.Inverter() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
- LinkDumper.LinkNode - Class in org.apache.nutch.scoring.webgraph
-
Bean class which holds url to node information.
- LinkDumper.LinkNode() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- LinkDumper.LinkNode(String, Node) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- LinkDumper.LinkNodes - Class in org.apache.nutch.scoring.webgraph
-
Writable class which holds an array of LinkNode objects.
- LinkDumper.LinkNodes() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- LinkDumper.LinkNodes(LinkDumper.LinkNode[]) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- LinkDumper.Merger - Class in org.apache.nutch.scoring.webgraph
-
Merges LinkNode objects into a single array value per url.
- LinkDumper.Merger() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
- LinkDumper.Reader - Class in org.apache.nutch.scoring.webgraph
-
Reader class which will print out the url and all of its inlinks to system
out.
- LinkDumper.Reader() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Reader
-
- LinkRank - Class in org.apache.nutch.scoring.webgraph
-
- LinkRank() - Constructor for class org.apache.nutch.scoring.webgraph.LinkRank
-
Default constructor.
- LinkRank(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.LinkRank
-
Configurable constructor.
- list(List<Path>, Writer) - Method in class org.apache.nutch.segment.SegmentReader
-
- LOCATION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
-
- LOCK_NAME - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- LOCK_NAME - Static variable in class org.apache.nutch.crawl.LinkDb
-
- LOCK_NAME - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
-
- LockUtil - Class in org.apache.nutch.util
-
Utility methods for handling application-level locking.
- LockUtil() - Constructor for class org.apache.nutch.util.LockUtil
-
- LOG - Static variable in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
- LOG - Static variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- LOG - Static variable in class org.apache.nutch.crawl.CrawlDb
-
- LOG - Static variable in class org.apache.nutch.crawl.CrawlDbFilter
-
- LOG - Static variable in class org.apache.nutch.crawl.CrawlDbReader
-
- LOG - Static variable in class org.apache.nutch.crawl.CrawlDbReducer
-
- LOG - Static variable in class org.apache.nutch.crawl.DeduplicationJob
-
- LOG - Static variable in class org.apache.nutch.crawl.FetchScheduleFactory
-
- LOG - Static variable in class org.apache.nutch.crawl.Generator
-
- LOG - Static variable in class org.apache.nutch.crawl.Injector
-
- LOG - Static variable in class org.apache.nutch.crawl.LinkDb
-
- LOG - Static variable in class org.apache.nutch.crawl.LinkDbFilter
-
- LOG - Static variable in class org.apache.nutch.crawl.LinkDbReader
-
- LOG - Static variable in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- LOG - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- LOG - Static variable in class org.apache.nutch.fetcher.Fetcher
-
- LOG - Static variable in class org.apache.nutch.fetcher.OldFetcher
-
- LOG - Static variable in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
- LOG - Static variable in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
- LOG - Static variable in class org.apache.nutch.indexer.CleaningJob
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexingFilters
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexingJob
-
- LOG - Static variable in class org.apache.nutch.indexer.IndexWriters
-
- LOG - Static variable in class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- LOG - Static variable in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
Logger
- LOG - Static variable in class org.apache.nutch.indexer.tld.TLDIndexingFilter
-
- LOG - Static variable in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- LOG - Static variable in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- LOG - Static variable in class org.apache.nutch.indexwriter.solr.SolrUtils
-
- LOG - Static variable in class org.apache.nutch.microformats.reltag.RelTagParser
-
- LOG - Static variable in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
-
- LOG - Static variable in class org.apache.nutch.net.URLNormalizers
-
- LOG - Static variable in class org.apache.nutch.parse.ext.ExtParser
-
- LOG - Static variable in class org.apache.nutch.parse.feed.FeedParser
-
- LOG - Static variable in class org.apache.nutch.parse.html.HtmlParser
-
- LOG - Static variable in class org.apache.nutch.parse.js.JSParseFilter
-
- LOG - Static variable in class org.apache.nutch.parse.ParserChecker
-
- LOG - Static variable in class org.apache.nutch.parse.ParseResult
-
- LOG - Static variable in class org.apache.nutch.parse.ParserFactory
-
- LOG - Static variable in class org.apache.nutch.parse.ParseSegment
-
- LOG - Static variable in class org.apache.nutch.parse.ParseUtil
-
- LOG - Static variable in class org.apache.nutch.parse.swf.SWFParser
-
- LOG - Static variable in class org.apache.nutch.parse.tika.TikaParser
-
- LOG - Static variable in class org.apache.nutch.parse.zip.ZipTextExtractor
-
- LOG - Static variable in class org.apache.nutch.plugin.PluginDescriptor
-
- LOG - Static variable in class org.apache.nutch.plugin.PluginManifestParser
-
- LOG - Static variable in class org.apache.nutch.plugin.PluginRepository
-
- LOG - Static variable in class org.apache.nutch.protocol.file.File
-
- LOG - Static variable in class org.apache.nutch.protocol.ftp.Ftp
-
- LOG - Static variable in class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
-
- LOG - Static variable in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
-
- LOG - Static variable in class org.apache.nutch.protocol.http.Http
-
- LOG - Static variable in class org.apache.nutch.protocol.httpclient.Http
-
- LOG - Static variable in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
- LOG - Static variable in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
- LOG - Static variable in class org.apache.nutch.protocol.ProtocolFactory
-
- LOG - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.LinkDumper
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.LinkRank
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.Loops
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.NodeDumper
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- LOG - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
-
- LOG - Static variable in class org.apache.nutch.segment.SegmentReader
-
- LOG - Static variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- LOG - Static variable in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
- LOG - Static variable in class org.apache.nutch.tools.DmozParser
-
- LOG - Static variable in class org.apache.nutch.tools.ResolveUrls
-
- LOG - Static variable in class org.apache.nutch.util.EncodingDetector
-
- LOG - Static variable in class org.creativecommons.nutch.CCIndexingFilter
-
- LOG - Static variable in class org.creativecommons.nutch.CCParseFilter
-
- logConf() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- LogDebugHandler - Class in org.apache.nutch.tools.proxy
-
- LogDebugHandler() - Constructor for class org.apache.nutch.tools.proxy.LogDebugHandler
-
- login(String, String) - Method in class org.apache.nutch.protocol.ftp.Client
-
Login to the FTP server using the provided username and password.
- logout() - Method in class org.apache.nutch.protocol.ftp.Client
-
Logout of the FTP server by sending the QUIT command.
- longestMatch(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns the longest prefix of input that is matched,
or null if no match exists.
- longestMatch(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns the longest suffix of input that is matched,
or null if no match exists.
- longestMatch(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns the longest substring of input that is
matched by a pattern in the trie, or null if no match
exists.
- LoopReader - Class in org.apache.nutch.scoring.webgraph
-
The LoopReader tool prints the loopset information for a single url.
- LoopReader() - Constructor for class org.apache.nutch.scoring.webgraph.LoopReader
-
- LoopReader(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.LoopReader
-
- Loops - Class in org.apache.nutch.scoring.webgraph
-
The Loops job identifies cycles of loops inside of the web graph.
- Loops() - Constructor for class org.apache.nutch.scoring.webgraph.Loops
-
- Loops.Finalizer - Class in org.apache.nutch.scoring.webgraph
-
Finishes the Loops job by aggregating and collecting and found routes.
- Loops.Finalizer() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Default constructor.
- Loops.Finalizer(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Configurable constructor.
- Loops.Initializer - Class in org.apache.nutch.scoring.webgraph
-
Initializes the Loop routes.
- Loops.Initializer() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Default constructor.
- Loops.Initializer(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Configurable constructor.
- Loops.Looper - Class in org.apache.nutch.scoring.webgraph
-
Follows a route path looking for the start url of the route.
- Loops.Looper() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Default constructor.
- Loops.Looper(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Configurable constructor.
- Loops.LoopSet - Class in org.apache.nutch.scoring.webgraph
-
A set of loops.
- Loops.LoopSet() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- Loops.Route - Class in org.apache.nutch.scoring.webgraph
-
A link path or route looking to identify a link cycle.
- Loops.Route() - Constructor for class org.apache.nutch.scoring.webgraph.Loops.Route
-
- LOOPS_DIR - Static variable in class org.apache.nutch.scoring.webgraph.Loops
-
- m_currentNode - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Current node
- m_doc - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Root document
- m_docFrag - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
First node of document fragment or null if not a DocumentFragment
- m_elemStack - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Vector of element nodes
- m_inCData - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Flag indicating that we are processing a CData section
- main(String[]) - Static method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDb
-
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDbMerger
-
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDbReader
-
- main(String[]) - Static method in class org.apache.nutch.crawl.DeduplicationJob
-
- main(String[]) - Static method in class org.apache.nutch.crawl.Generator
-
Generate a fetchlist from the crawldb.
- main(String[]) - Static method in class org.apache.nutch.crawl.Injector
-
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDb
-
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDbMerger
-
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDbReader
-
- main(String[]) - Static method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- main(String[]) - Static method in class org.apache.nutch.crawl.TextProfileSignature
-
- main(String[]) - Static method in class org.apache.nutch.fetcher.Fetcher
-
Run the fetcher.
- main(String[]) - Static method in class org.apache.nutch.fetcher.OldFetcher
-
Run the fetcher.
- main(String[]) - Static method in class org.apache.nutch.indexer.CleaningJob
-
- main(String[]) - Static method in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- main(String[]) - Static method in class org.apache.nutch.indexer.IndexingJob
-
- main(String[]) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
-
- main(String[]) - Static method in class org.apache.nutch.net.URLFilterChecker
-
- main(String[]) - Static method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
Spits out patterns and substitutions that are in the configuration file.
- main(String[]) - Static method in class org.apache.nutch.net.URLNormalizerChecker
-
- main(String[]) - Static method in class org.apache.nutch.parse.feed.FeedParser
-
Runs a command line version of this
Parser
.
- main(String[]) - Static method in class org.apache.nutch.parse.html.HtmlParser
-
- main(String[]) - Static method in class org.apache.nutch.parse.js.JSParseFilter
-
- main(String[]) - Static method in class org.apache.nutch.parse.ParseData
-
- main(String[]) - Static method in class org.apache.nutch.parse.ParserChecker
-
- main(String[]) - Static method in class org.apache.nutch.parse.ParseSegment
-
- main(String[]) - Static method in class org.apache.nutch.parse.ParseText
-
- main(String[]) - Static method in class org.apache.nutch.parse.swf.SWFParser
-
Arguments are: 0.
- main(String[]) - Static method in class org.apache.nutch.plugin.PluginRepository
-
Loads all necessary dependencies for a selected plugin, and then runs one
of the classes' main() method.
- main(String[]) - Static method in class org.apache.nutch.protocol.Content
-
- main(String[]) - Static method in class org.apache.nutch.protocol.file.File
-
Quick way for running this class.
- main(String[]) - Static method in class org.apache.nutch.protocol.ftp.Ftp
-
For debugging.
- main(HttpBase, String[]) - Static method in class org.apache.nutch.protocol.http.api.HttpBase
-
- main(String[]) - Static method in class org.apache.nutch.protocol.http.Http
-
- main(String[]) - Static method in class org.apache.nutch.protocol.httpclient.Http
-
Main method.
- main(String[]) - Static method in class org.apache.nutch.protocol.RobotRulesParser
-
command-line main for testing
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkDumper
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkDumper.Reader
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkRank
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LoopReader
-
Runs the LoopReader tool.
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.Loops
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.NodeDumper
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.NodeReader
-
Runs the NodeReader tool.
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.WebGraph
-
- main(String[]) - Static method in class org.apache.nutch.segment.SegmentMerger
-
- main(String[]) - Static method in class org.apache.nutch.segment.SegmentReader
-
- main(String[]) - Static method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
- main(String[]) - Static method in class org.apache.nutch.tools.Benchmark
-
- main(String[]) - Static method in class org.apache.nutch.tools.DmozParser
-
Command-line access.
- main(String[]) - Static method in class org.apache.nutch.tools.FreeGenerator
-
- main(String[]) - Static method in class org.apache.nutch.tools.proxy.TestbedProxy
-
- main(String[]) - Static method in class org.apache.nutch.tools.ResolveUrls
-
Runs the resolve urls tool.
- main(RegexURLFilterBase, String[]) - Static method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Filter the standard input using a RegexURLFilterBase.
- main(String[]) - Static method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
-
- main(String[]) - Static method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- main(String[]) - Static method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- main(String[]) - Static method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- main(String[]) - Static method in class org.apache.nutch.util.CommandRunner
-
- main(String[]) - Static method in class org.apache.nutch.util.domain.DomainStatistics
-
- main(String[]) - Static method in class org.apache.nutch.util.EncodingDetector
-
- main(String[]) - Static method in class org.apache.nutch.util.PrefixStringMatcher
-
- main(String[]) - Static method in class org.apache.nutch.util.StringUtil
-
- main(String[]) - Static method in class org.apache.nutch.util.SuffixStringMatcher
-
- main(String[]) - Static method in class org.apache.nutch.util.URLUtil
-
For testing
- majorCodes - Static variable in class org.apache.nutch.parse.ParseStatus
-
- makeIOException(SolrServerException) - Static method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- map(Text, CrawlDatum, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbFilter
-
- map(Text, CrawlDatum, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
-
- map(Text, CrawlDatum, OutputCollector<Text, LongWritable>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
-
- map(Text, CrawlDatum, OutputCollector<FloatWritable, Text>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
-
- map(Text, CrawlDatum, OutputCollector<BytesWritable, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.DeduplicationJob.DBFilter
-
- map(Text, CrawlDatum, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater
-
- map(Text, CrawlDatum, OutputCollector<FloatWritable, Generator.SelectorEntry>, Reporter) - Method in class org.apache.nutch.crawl.Generator.Selector
-
Select & invert subset due for fetch.
- map(FloatWritable, Generator.SelectorEntry, OutputCollector<Text, Generator.SelectorEntry>, Reporter) - Method in class org.apache.nutch.crawl.Generator.SelectorInverseMapper
-
- map(WritableComparable<?>, Text, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Injector.InjectMapper
-
- map(Text, ParseData, OutputCollector<Text, Inlinks>, Reporter) - Method in class org.apache.nutch.crawl.LinkDb
-
- map(Text, Inlinks, OutputCollector<Text, Inlinks>, Reporter) - Method in class org.apache.nutch.crawl.LinkDbFilter
-
- map(Text, CrawlDatum, OutputCollector<ByteWritable, Text>, Reporter) - Method in class org.apache.nutch.indexer.CleaningJob.DBFilter
-
- map(Text, Writable, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.indexer.IndexerMapReduce
-
- map(WritableComparable<?>, Content, OutputCollector<Text, ParseImpl>, Reporter) - Method in class org.apache.nutch.parse.ParseSegment
-
- map(Text, Writable, OutputCollector<Text, ObjectWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
Wraps all values in ObjectWritables.
- map(Text, Loops.Route, OutputCollector<Text, Loops.Route>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Maps out and found routes, those will be the link cycles.
- map(Text, Writable, OutputCollector<Text, ObjectWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Wraps values in ObjectWritable.
- map(Text, Writable, OutputCollector<Text, ObjectWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Wrap values in ObjectWritable.
- map(Text, Node, OutputCollector<Text, FloatWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
-
Outputs the host or domain as key for this record and numInlinks, numOutlinks
or score as the value.
- map(Text, Node, OutputCollector<FloatWritable, Text>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
-
Outputs the url with the appropriate number of inlinks, outlinks, or for
score.
- map(Text, Writable, OutputCollector<Text, ObjectWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
Changes input into ObjectWritables.
- map(Text, Writable, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
Passes through existing LinkDatum objects from an existing OutlinkDb and
maps out new LinkDatum objects from new crawls ParseData.
- map(Text, MetaWrapper, OutputCollector<Text, MetaWrapper>, Reporter) - Method in class org.apache.nutch.segment.SegmentMerger
-
- map(WritableComparable<?>, Writable, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.segment.SegmentReader.InputCompatMapper
-
- map(Text, BytesWritable, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Runs the Map job to translate an arc record into output for Nutch
segments.
- map(WritableComparable<?>, Text, OutputCollector<Text, Generator.SelectorEntry>, Reporter) - Method in class org.apache.nutch.tools.FreeGenerator.FG
-
- mapCopyKey(String) - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- mapKey(String) - Method in class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- MAPPING_FILE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- MapWritable - Class in org.apache.nutch.crawl
-
Deprecated.
Use org.apache.hadoop.io.MapWritable instead.
- MapWritable() - Constructor for class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- MapWritable(MapWritable) - Constructor for class org.apache.nutch.crawl.MapWritable
-
Deprecated.
Copy constructor.
- match(String) - Method in class org.apache.nutch.urlfilter.api.RegexRule
-
Checks if a url matches this rule.
- matchChar(TrieStringMatcher.TrieNode, String, int) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns the next
TrieStringMatcher.TrieNode
visited, given that you are at
node
, and the the next character in the input is
the
idx
'th character of
s
.
- matches(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns true if the given String
is matched by a
prefix in the trie
- matches(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns true if the given String
is matched by a
suffix in the trie
- matches(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns true if the given String
is matched by a
pattern in the trie
- maxContent - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The length limit for downloaded content, in bytes.
- maxCrawlDelay - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Skip page if Crawl-Delay longer than this value.
- maxInterval - Variable in class org.apache.nutch.crawl.AbstractFetchSchedule
-
- MD5Signature - Class in org.apache.nutch.crawl
-
Default implementation of a page signature.
- MD5Signature() - Constructor for class org.apache.nutch.crawl.MD5Signature
-
- merge(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.CrawlDbMerger
-
- merge(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- merge(Path, Path[], boolean, boolean, long) - Method in class org.apache.nutch.segment.SegmentMerger
-
- Metadata - Class in org.apache.nutch.metadata
-
A multi-valued metadata container.
- Metadata() - Constructor for class org.apache.nutch.metadata.Metadata
-
Constructs a new, empty metadata.
- MetadataIndexer - Class in org.apache.nutch.indexer.metadata
-
Indexer which can be configured to extract metadata from the crawldb, parse metadata or content metadata.
- MetadataIndexer() - Constructor for class org.apache.nutch.indexer.metadata.MetadataIndexer
-
- MetaTagsParser - Class in org.apache.nutch.parse
-
Parse HTML meta tags (keywords, description) and store them in the parse metadata so that
they can be indexed with the index-metadata plugin with the prefix 'metatag.'
- MetaTagsParser() - Constructor for class org.apache.nutch.parse.MetaTagsParser
-
- MetaWrapper - Class in org.apache.nutch.metadata
-
This is a simple decorator that adds metadata to any Writable-s that can be
serialized by NutchWritable.
- MetaWrapper() - Constructor for class org.apache.nutch.metadata.MetaWrapper
-
- MetaWrapper(Writable, Configuration) - Constructor for class org.apache.nutch.metadata.MetaWrapper
-
- MetaWrapper(Metadata, Writable, Configuration) - Constructor for class org.apache.nutch.metadata.MetaWrapper
-
- MimeAdaptiveFetchSchedule - Class in org.apache.nutch.crawl
-
Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration
of DEC and INC factors for various MIME-types.
- MimeAdaptiveFetchSchedule() - Constructor for class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- MimeUtil - Class in org.apache.nutch.util
-
- MimeUtil(Configuration) - Constructor for class org.apache.nutch.util.MimeUtil
-
- MIN_CONFIDENCE_KEY - Static variable in class org.apache.nutch.util.EncodingDetector
-
- MissingDependencyException - Exception in org.apache.nutch.plugin
-
MissingDependencyException
will be thrown if a plugin
dependency cannot be found.
- MissingDependencyException(Throwable) - Constructor for exception org.apache.nutch.plugin.MissingDependencyException
-
- MissingDependencyException(String) - Constructor for exception org.apache.nutch.plugin.MissingDependencyException
-
- MODIFIED - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Date on which the resource was changed.
- MoreIndexingFilter - Class in org.apache.nutch.indexer.more
-
Add (or reset) a few metaData properties as respective fields (if they are
available), so that they can be accurately used within the search index.
- MoreIndexingFilter() - Constructor for class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- MOVED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Resource has moved permanently.
- PARAMS - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
Deprecated.
- parse(InputStream) - Method in class org.apache.nutch.collection.CollectionManager
-
- Parse - Interface in org.apache.nutch.parse
-
The result of parsing a page's raw content.
- parse(Path) - Method in class org.apache.nutch.parse.ParseSegment
-
- parse(Content) - Method in class org.apache.nutch.parse.ParseUtil
-
Performs a parse by iterating through a List of preferred
Parser
s
until a successful parse is performed and a
Parse
object is
returned.
- parse(String) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a String in format "segmentName/partName".
- PARSE_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- parseByExtensionId(String, Content) - Method in class org.apache.nutch.parse.ParseUtil
-
Method parses a
Content
object using the
Parser
specified
by the parameter
extId
, i.e., the Parser's extension ID.
- parseCharacterEncoding(String) - Static method in class org.apache.nutch.util.EncodingDetector
-
Parse the character encoding from the specified content type header.
- parsed - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- ParseData - Class in org.apache.nutch.parse
-
Data extracted from a page's content.
- ParseData() - Constructor for class org.apache.nutch.parse.ParseData
-
- ParseData(ParseStatus, String, Outlink[], Metadata) - Constructor for class org.apache.nutch.parse.ParseData
-
- ParseData(ParseStatus, String, Outlink[], Metadata, Metadata) - Constructor for class org.apache.nutch.parse.ParseData
-
- parseDmozFile(File, int, boolean, int, Pattern) - Method in class org.apache.nutch.tools.DmozParser
-
Iterate through all the items in this structured DMOZ file.
- parseErrors - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- ParseException - Exception in org.apache.nutch.parse
-
- ParseException() - Constructor for exception org.apache.nutch.parse.ParseException
-
- ParseException(String) - Constructor for exception org.apache.nutch.parse.ParseException
-
- ParseException(String, Throwable) - Constructor for exception org.apache.nutch.parse.ParseException
-
- ParseException(Throwable) - Constructor for exception org.apache.nutch.parse.ParseException
-
- ParseImpl - Class in org.apache.nutch.parse
-
The result of parsing a page's raw content.
- ParseImpl() - Constructor for class org.apache.nutch.parse.ParseImpl
-
- ParseImpl(Parse) - Constructor for class org.apache.nutch.parse.ParseImpl
-
- ParseImpl(String, ParseData) - Constructor for class org.apache.nutch.parse.ParseImpl
-
- ParseImpl(ParseText, ParseData) - Constructor for class org.apache.nutch.parse.ParseImpl
-
- ParseImpl(ParseText, ParseData, boolean) - Constructor for class org.apache.nutch.parse.ParseImpl
-
- parseList(List<String>, String) - Method in class org.apache.nutch.collection.Subcollection
-
Create a list of patterns from chunk of text, patterns are separated with
newline
- ParseOutputFormat - Class in org.apache.nutch.parse
-
- ParseOutputFormat() - Constructor for class org.apache.nutch.parse.ParseOutputFormat
-
- parsePluginFolder(String[]) - Method in class org.apache.nutch.plugin.PluginManifestParser
-
Returns a list of all found plugin descriptors.
- Parser - Interface in org.apache.nutch.parse
-
A parser for content generated by a
Protocol
implementation.
- ParserChecker - Class in org.apache.nutch.parse
-
Parser checker, useful for testing parser.
- ParserChecker() - Constructor for class org.apache.nutch.parse.ParserChecker
-
- ParseResult - Class in org.apache.nutch.parse
-
A utility class that stores result of a parse.
- ParseResult(String) - Constructor for class org.apache.nutch.parse.ParseResult
-
Create a container for parse results.
- ParserFactory - Class in org.apache.nutch.parse
-
Creates and caches
Parser
plugins.
- ParserFactory(Configuration) - Constructor for class org.apache.nutch.parse.ParserFactory
-
- ParserNotFound - Exception in org.apache.nutch.parse
-
- ParserNotFound(String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
-
- ParserNotFound(String, String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
-
- ParserNotFound(String, String, String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
-
- parseRules(String, byte[], String, String) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Parses the robots content using the SimpleRobotRulesParser
from crawler commons
- ParseSegment - Class in org.apache.nutch.parse
-
- ParseSegment() - Constructor for class org.apache.nutch.parse.ParseSegment
-
- ParseSegment(Configuration) - Constructor for class org.apache.nutch.parse.ParseSegment
-
- ParseStatus - Class in org.apache.nutch.parse
-
- ParseStatus() - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int, int, String[]) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int, String[]) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int, int) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseStatus(int, int, String) - Constructor for class org.apache.nutch.parse.ParseStatus
-
Simplified constructor for passing just a text message.
- ParseStatus(int, String) - Constructor for class org.apache.nutch.parse.ParseStatus
-
Simplified constructor for passing just a text message.
- ParseStatus(Throwable) - Constructor for class org.apache.nutch.parse.ParseStatus
-
- ParseText - Class in org.apache.nutch.parse
-
- ParseText() - Constructor for class org.apache.nutch.parse.ParseText
-
- ParseText(String) - Constructor for class org.apache.nutch.parse.ParseText
-
- ParseUtil - Class in org.apache.nutch.parse
-
A Utility class containing methods to simply perform parsing utilities such
as iterating through a preferred list of
Parser
s to obtain
Parse
objects.
- ParseUtil(Configuration) - Constructor for class org.apache.nutch.parse.ParseUtil
-
- PARTITION_MODE_DOMAIN - Static variable in class org.apache.nutch.crawl.URLPartitioner
-
- PARTITION_MODE_HOST - Static variable in class org.apache.nutch.crawl.URLPartitioner
-
- PARTITION_MODE_IP - Static variable in class org.apache.nutch.crawl.URLPartitioner
-
- PARTITION_MODE_KEY - Static variable in class org.apache.nutch.crawl.URLPartitioner
-
- partName - Variable in class org.apache.nutch.segment.SegmentPart
-
Name of the segment part (ie.
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
- passScoreAfterParsing(Text, Content, Parse) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Currently a part of score distribution is performed using only data coming
from the parsing process.
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.ScoringFilters
-
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Takes the metadata, which was lumped inside the content, and replicates it
within your parse data.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method takes all relevant score information from the current datum
(coming from a generated fetchlist) and stores it into
Content
metadata.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.ScoringFilters
-
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Takes the metadata, specified in your "urlmeta.tags" property, from the
datum object and injects it into the content.
- PassURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.pass
-
This URLNormalizer doesn't change urls.
- PassURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
-
- PASSWORD - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- PERM_REFRESH_TIME - Static variable in class org.apache.nutch.fetcher.Fetcher
-
- PERM_REFRESH_TIME - Static variable in class org.apache.nutch.fetcher.OldFetcher
-
- Pluggable - Interface in org.apache.nutch.plugin
-
Defines the capability of a class to be plugged into Nutch.
- Plugin - Class in org.apache.nutch.plugin
-
A nutch-plugin is an container for a set of custom logic that provide
extensions to the nutch core functionality or another plugin that provides an
API for extending.
- Plugin(PluginDescriptor, Configuration) - Constructor for class org.apache.nutch.plugin.Plugin
-
Constructor
- PluginClassLoader - Class in org.apache.nutch.plugin
-
The PluginClassLoader
contains only classes of the runtime
libraries setuped in the plugin manifest file and exported libraries of
plugins that are required pluguin.
- PluginClassLoader(URL[], ClassLoader) - Constructor for class org.apache.nutch.plugin.PluginClassLoader
-
Construtor
- PluginDescriptor - Class in org.apache.nutch.plugin
-
The PluginDescriptor
provide access to all meta information of
a nutch-plugin, as well to the internationalizable resources and the plugin
own classloader.
- PluginDescriptor(String, String, String, String, String, String, Configuration) - Constructor for class org.apache.nutch.plugin.PluginDescriptor
-
Constructor
- PluginManifestParser - Class in org.apache.nutch.plugin
-
The PluginManifestParser
parser just parse the manifest file
in all plugin directories.
- PluginManifestParser(Configuration, PluginRepository) - Constructor for class org.apache.nutch.plugin.PluginManifestParser
-
- PluginRepository - Class in org.apache.nutch.plugin
-
The plugin repositority is a registry of all plugins.
- PluginRepository(Configuration) - Constructor for class org.apache.nutch.plugin.PluginRepository
-
- PluginRuntimeException - Exception in org.apache.nutch.plugin
-
PluginRuntimeException
will be thrown until a exception in the
plugin managemnt occurs.
- PluginRuntimeException(Throwable) - Constructor for exception org.apache.nutch.plugin.PluginRuntimeException
-
- PluginRuntimeException(String) - Constructor for exception org.apache.nutch.plugin.PluginRuntimeException
-
- pos - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- PrefixStringMatcher - Class in org.apache.nutch.util
-
A class for efficiently matching String
s against a set
of prefixes.
- PrefixStringMatcher(String[]) - Constructor for class org.apache.nutch.util.PrefixStringMatcher
-
Creates a new PrefixStringMatcher
which will match
String
s with any prefix in the supplied array.
- PrefixStringMatcher(Collection<String>) - Constructor for class org.apache.nutch.util.PrefixStringMatcher
-
Creates a new PrefixStringMatcher
which will match
String
s with any prefix in the supplied
Collection
.
- PrefixURLFilter - Class in org.apache.nutch.urlfilter.prefix
-
Filters URLs based on a file of URL prefixes.
- PrefixURLFilter() - Constructor for class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- PrefixURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- PrintCommandListener - Class in org.apache.nutch.protocol.ftp
-
This is a support class for logging all ftp command/reply traffic.
- PrintCommandListener(Logger) - Constructor for class org.apache.nutch.protocol.ftp.PrintCommandListener
-
- processDeflateEncoded(byte[], URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- processDumpJob(String, String, Configuration, String, String, String, Integer) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- processDumpJob(String, String) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- processGzipEncoded(byte[], URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- processingInstruction(String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of a processing instruction.
- processStatJob(String, Configuration, boolean) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- processTopNJob(String, long, float, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- PROTO_NOT_FOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
This protocol was not found.
- PROTO_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- Protocol - Interface in org.apache.nutch.protocol
-
A retriever of url content.
- PROTOCOL_REDIR - Static variable in class org.apache.nutch.fetcher.Fetcher
-
- PROTOCOL_REDIR - Static variable in class org.apache.nutch.fetcher.OldFetcher
-
- protocolCommandSent(ProtocolCommandEvent) - Method in class org.apache.nutch.protocol.ftp.PrintCommandListener
-
- ProtocolException - Exception in org.apache.nutch.net.protocols
-
- ProtocolException() - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(String) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(String, Throwable) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(Throwable) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException - Exception in org.apache.nutch.protocol
-
- ProtocolException() - Constructor for exception org.apache.nutch.protocol.ProtocolException
-
- ProtocolException(String) - Constructor for exception org.apache.nutch.protocol.ProtocolException
-
- ProtocolException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.ProtocolException
-
- ProtocolException(Throwable) - Constructor for exception org.apache.nutch.protocol.ProtocolException
-
- ProtocolFactory - Class in org.apache.nutch.protocol
-
- ProtocolFactory(Configuration) - Constructor for class org.apache.nutch.protocol.ProtocolFactory
-
- ProtocolNotFound - Exception in org.apache.nutch.protocol
-
- ProtocolNotFound(String) - Constructor for exception org.apache.nutch.protocol.ProtocolNotFound
-
- ProtocolNotFound(String, String) - Constructor for exception org.apache.nutch.protocol.ProtocolNotFound
-
- ProtocolOutput - Class in org.apache.nutch.protocol
-
Simple aggregate to pass from protocol plugins both content and
protocol status.
- ProtocolOutput(Content, ProtocolStatus) - Constructor for class org.apache.nutch.protocol.ProtocolOutput
-
- ProtocolOutput(Content) - Constructor for class org.apache.nutch.protocol.ProtocolOutput
-
- protocolReplyReceived(ProtocolCommandEvent) - Method in class org.apache.nutch.protocol.ftp.PrintCommandListener
-
- ProtocolStatus - Class in org.apache.nutch.protocol
-
- ProtocolStatus() - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, String[]) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, String[], long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, Object) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(int, Object, long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- ProtocolStatus(Throwable) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
-
- proxyHost - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The proxy hostname.
- proxyPort - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The proxy port.
- PUBLISHER - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity responsible for making the resource available.
- put(Writable, Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- put(Text, ParseText, ParseData) - Method in class org.apache.nutch.parse.ParseResult
-
Store a result of parsing.
- put(String, ParseText, ParseData) - Method in class org.apache.nutch.parse.ParseResult
-
Store a result of parsing.
- putAll(MapWritable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- putAllMetaData(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Add all metadata from other CrawlDatum to this CrawlDatum.
- read(DataInput) - Static method in class org.apache.nutch.crawl.CrawlDatum
-
- read(DataInput) - Static method in class org.apache.nutch.crawl.Inlink
-
- read(DataInput) - Static method in class org.apache.nutch.parse.Outlink
-
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseData
-
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseImpl
-
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseStatus
-
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseText
-
- read(DataInput) - Static method in class org.apache.nutch.protocol.Content
-
- read(DataInput) - Static method in class org.apache.nutch.protocol.ProtocolStatus
-
- readConfiguration(Reader) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Generator.SelectorEntry
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Inlink
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Inlinks
-
- readFields(DataInput) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchDocument
-
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchField
-
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchIndexAction
-
- readFields(DataInput) - Method in class org.apache.nutch.metadata.Metadata
-
- readFields(DataInput) - Method in class org.apache.nutch.metadata.MetaWrapper
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.Outlink
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseData
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseImpl
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseStatus
-
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseText
-
- readFields(DataInput) - Method in class org.apache.nutch.protocol.Content
-
- readFields(DataInput) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- readFields(DataInput) - Method in class org.apache.nutch.util.GenericWritableConfigurable
-
- readUrl(String, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
-
- REDIR_EXCEEDED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Too many redirects.
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
-
- reduce(Text, Iterator<LongWritable>, OutputCollector<Text, LongWritable>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatCombiner
-
- reduce(Text, Iterator<LongWritable>, OutputCollector<Text, LongWritable>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
-
- reduce(FloatWritable, Iterator<Text>, OutputCollector<FloatWritable, Text>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
-
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.CrawlDbReducer
-
- reduce(BytesWritable, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
-
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
-
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater
-
- reduce(Text, Iterator<Generator.SelectorEntry>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Generator.PartitionReducer
-
- reduce(FloatWritable, Iterator<Generator.SelectorEntry>, OutputCollector<FloatWritable, Generator.SelectorEntry>, Reporter) - Method in class org.apache.nutch.crawl.Generator.Selector
-
Collect until limit is reached.
- reduce(Text, Iterator<CrawlDatum>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.crawl.Injector.InjectReducer
-
- reduce(Text, Iterator<Inlinks>, OutputCollector<Text, Inlinks>, Reporter) - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- reduce(ByteWritable, Iterator<Text>, OutputCollector<Text, ByteWritable>, Reporter) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
-
- reduce(Text, Iterator<NutchWritable>, OutputCollector<Text, NutchIndexAction>, Reporter) - Method in class org.apache.nutch.indexer.IndexerMapReduce
-
- reduce(Text, Iterator<Writable>, OutputCollector<Text, Writable>, Reporter) - Method in class org.apache.nutch.parse.ParseSegment
-
- reduce(Text, Iterator<ObjectWritable>, OutputCollector<Text, LinkDumper.LinkNode>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
-
Inverts outlinks to inlinks while attaching node information to the
outlink.
- reduce(Text, Iterator<LinkDumper.LinkNode>, OutputCollector<Text, LinkDumper.LinkNodes>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
Aggregate all LinkNode objects for a given url.
- reduce(Text, Iterator<Loops.Route>, OutputCollector<Text, Loops.LoopSet>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Finalizer
-
Aggregates all found routes for a given start url into a loopset and
collects the loopset.
- reduce(Text, Iterator<ObjectWritable>, OutputCollector<Text, Loops.Route>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Initializer
-
Takes any node that has inlinks and sets up a route for all of its
outlinks.
- reduce(Text, Iterator<ObjectWritable>, OutputCollector<Text, Loops.Route>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.Loops.Looper
-
Performs a single loop pass looking for loop cycles within routes.
- reduce(Text, Iterator<FloatWritable>, OutputCollector<Text, FloatWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
-
Outputs either the sum or the top value for this record.
- reduce(FloatWritable, Iterator<Text>, OutputCollector<Text, FloatWritable>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
-
Flips and collects the url and numeric sort value.
- reduce(Text, Iterator<ObjectWritable>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
Creates new CrawlDatum objects with the updated score from the NodeDb or
with a cleared score.
- reduce(Text, Iterator<NutchWritable>, OutputCollector<Text, LinkDatum>, Reporter) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
- reduce(Text, Iterator<MetaWrapper>, OutputCollector<Text, MetaWrapper>, Reporter) - Method in class org.apache.nutch.segment.SegmentMerger
-
NOTE: in selecting the latest version we rely exclusively on the segment
name (not all segment data contain time information).
- reduce(Text, Iterator<NutchWritable>, OutputCollector<Text, Text>, Reporter) - Method in class org.apache.nutch.segment.SegmentReader
-
- reduce(Text, Iterator<Generator.SelectorEntry>, OutputCollector<Text, CrawlDatum>, Reporter) - Method in class org.apache.nutch.tools.FreeGenerator.FG
-
- reduce(Text, Iterable<LongWritable>, Reducer<Text, LongWritable, Text, LongWritable>.Context) - Method in class org.apache.nutch.util.domain.DomainStatistics.DomainStatisticsCombiner
-
- regexNormalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
This function does the replacements by iterating through all the regex
patterns.
- RegexRule - Class in org.apache.nutch.urlfilter.api
-
A generic regular expression rule.
- RegexRule(boolean, String) - Constructor for class org.apache.nutch.urlfilter.api.RegexRule
-
Constructs a new regular expression rule.
- RegexURLFilter - Class in org.apache.nutch.urlfilter.regex
-
- RegexURLFilter() - Constructor for class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- RegexURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- RegexURLFilterBase - Class in org.apache.nutch.urlfilter.api
-
- RegexURLFilterBase() - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new empty RegexURLFilterBase
- RegexURLFilterBase(File) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and init it with a file of rules.
- RegexURLFilterBase(String) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and inits it with a list of rules.
- RegexURLFilterBase(Reader) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and init it with a Reader of rules.
- RegexURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.regex
-
Allows users to do regex substitutions on all/any URLs that are encountered,
which is useful for stripping session IDs from URLs.
- RegexURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
The default constructor which is called from UrlNormalizerFactory
(normalizerClass.newInstance()) in method: getNormalizer()*
- RegexURLNormalizer(Configuration) - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
- RegexURLNormalizer(Configuration, String) - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
Constructor which can be passed the file name, so it doesn't look in the
configuration files for it.
- REL_TAG - Static variable in class org.apache.nutch.microformats.reltag.RelTagParser
-
- RELATION - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A reference to a related resource.
- RelTagIndexingFilter - Class in org.apache.nutch.microformats.reltag
-
- RelTagIndexingFilter() - Constructor for class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
-
- RelTagParser - Class in org.apache.nutch.microformats.reltag
-
Adds microformat rel-tags of document if found.
- RelTagParser() - Constructor for class org.apache.nutch.microformats.reltag.RelTagParser
-
- remove(Writable) - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- remove(String) - Method in class org.apache.nutch.metadata.Metadata
-
Remove a metadata and all its associated values.
- remove(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
- removeField(String) - Method in class org.apache.nutch.indexer.NutchDocument
-
- removeLockFile(FileSystem, Path) - Static method in class org.apache.nutch.util.LockUtil
-
Remove lock file.
- replace(FileSystem, Path, Path, boolean) - Static method in class org.apache.nutch.util.FSUtils
-
Replaces the current path with the new path and if set removes the old
path.
- REPR_URL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- reset() - Method in class org.apache.nutch.indexer.NutchField
-
- reset() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets all boolean values to false
.
- resolveEncodingAlias(String) - Static method in class org.apache.nutch.util.EncodingDetector
-
- resolveURL(URL, String) - Static method in class org.apache.nutch.util.URLUtil
-
Resolve relative URL-s and fix a few java.net.URL errors
in handling of URLs with embedded params and pure query
targets.
- ResolveUrls - Class in org.apache.nutch.tools
-
A simple tool that will spin up multiple threads to resolve urls to ip
addresses.
- ResolveUrls(String) - Constructor for class org.apache.nutch.tools.ResolveUrls
-
Create a new ResolveUrls with a file from the local file system.
- ResolveUrls(String, int) - Constructor for class org.apache.nutch.tools.ResolveUrls
-
Create a new ResolveUrls with a urls file and a number of threads for the
Thread pool.
- resolveUrls() - Method in class org.apache.nutch.tools.ResolveUrls
-
Creates a thread pool for resolving urls.
- Response - Interface in org.apache.nutch.net.protocols
-
A response inteface.
- RESPONSE_TIME - Static variable in class org.apache.nutch.protocol.http.api.HttpBase
-
- responseTime - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Record response time in CrawlDatum's meta data, see property
http.store.responsetime.
- retrieveFile(String, OutputStream, int) - Method in class org.apache.nutch.protocol.ftp.Client
-
retrieve file for path
- retrieveList(String, List<FTPFile>, int, FTPFileEntryParser) - Method in class org.apache.nutch.protocol.ftp.Client
-
retrieve list reply for path
- RETRY - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Temporary failure.
- rightPad(String, int) - Static method in class org.apache.nutch.util.StringUtil
-
Returns a copy of s
padded with trailing spaces so
that it's length is length
.
- RIGHTS - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Information about rights held in and over the resource.
- RobotRules - Interface in org.apache.nutch.protocol
-
This class holds the rules which were parsed from a robots.txt file, and can
test paths against those rules.
- RobotRulesParser - Class in org.apache.nutch.protocol
-
This class uses crawler-commons for handling the parsing of robots.txt
files.
- RobotRulesParser() - Constructor for class org.apache.nutch.protocol.RobotRulesParser
-
- RobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.RobotRulesParser
-
- ROBOTS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Access denied by robots.txt rules.
- root - Variable in class org.apache.nutch.util.TrieStringMatcher
-
- ROUTES_DIR - Static variable in class org.apache.nutch.scoring.webgraph.Loops
-
- run(String[]) - Method in class org.apache.nutch.crawl.CrawlDb
-
- run(String[]) - Method in class org.apache.nutch.crawl.CrawlDbMerger
-
- run(String[]) - Method in class org.apache.nutch.crawl.DeduplicationJob
-
- run(String[]) - Method in class org.apache.nutch.crawl.Generator
-
- run(String[]) - Method in class org.apache.nutch.crawl.Injector
-
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDb
-
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDbMerger
-
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDbReader
-
- run(RecordReader<Text, CrawlDatum>, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.fetcher.Fetcher
-
- run(String[]) - Method in class org.apache.nutch.fetcher.Fetcher
-
- run(RecordReader<WritableComparable<?>, Writable>, OutputCollector<Text, NutchWritable>, Reporter) - Method in class org.apache.nutch.fetcher.OldFetcher
-
- run(String[]) - Method in class org.apache.nutch.fetcher.OldFetcher
-
- run(String[]) - Method in class org.apache.nutch.indexer.CleaningJob
-
- run(String[]) - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- run(String[]) - Method in class org.apache.nutch.indexer.IndexingJob
-
- run(String[]) - Method in class org.apache.nutch.parse.ParserChecker
-
- run(String[]) - Method in class org.apache.nutch.parse.ParseSegment
-
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper
-
Runs the LinkDumper tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.LinkRank
-
Runs the LinkRank tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.Loops
-
Runs the Loops tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper
-
Runs the node dumper tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
Runs the ScoreUpdater tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.WebGraph
-
Parses command link arguments and runs the WebGraph jobs.
- run(String[]) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
- run(String[]) - Method in class org.apache.nutch.tools.Benchmark
-
- run(String[]) - Method in class org.apache.nutch.tools.FreeGenerator
-
- run(String[]) - Method in class org.apache.nutch.util.domain.DomainStatistics
-
- save() - Method in class org.apache.nutch.collection.CollectionManager
-
Save collections into file
- saveDom(OutputStream, Element) - Static method in class org.apache.nutch.util.DomUtil
-
save dom into ouputstream
- SCHEDULE_DEC_RATE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- SCHEDULE_INC_RATE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- SCHEDULE_MIME_FILE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- SCOPE_CRAWLDB - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when updating the CrawlDb with new URLs.
- SCOPE_DEFAULT - Static variable in class org.apache.nutch.net.URLNormalizers
-
Default scope.
- SCOPE_FETCHER - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used by
Fetcher
when processing
redirect URLs.
- SCOPE_GENERATE_HOST_COUNT - Static variable in class org.apache.nutch.net.URLNormalizers
-
- SCOPE_INDEXER - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when indexing URLs.
- SCOPE_INJECT - Static variable in class org.apache.nutch.net.URLNormalizers
-
- SCOPE_LINKDB - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when updating the LinkDb with new URLs.
- SCOPE_OUTLINK - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when constructing new
Outlink
instances.
- SCOPE_PARTITION - Static variable in class org.apache.nutch.net.URLNormalizers
-
- SCORE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- ScoreUpdater - Class in org.apache.nutch.scoring.webgraph
-
Updates the score from the WebGraph node database into the crawl database.
- ScoreUpdater() - Constructor for class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
- ScoringFilter - Interface in org.apache.nutch.scoring
-
A contract defining behavior of scoring plugins.
- ScoringFilterException - Exception in org.apache.nutch.scoring
-
Specialized exception for errors during scoring.
- ScoringFilterException() - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
-
- ScoringFilterException(String) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
-
- ScoringFilterException(String, Throwable) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
-
- ScoringFilterException(Throwable) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
-
- ScoringFilters - Class in org.apache.nutch.scoring
-
- ScoringFilters(Configuration) - Constructor for class org.apache.nutch.scoring.ScoringFilters
-
- SECONDS_PER_DAY - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
- SEGMENT_NAME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- SegmentHandler - Class in org.apache.nutch.tools.proxy
-
XXX should turn this into a plugin?
- SegmentHandler(Configuration, Path) - Constructor for class org.apache.nutch.tools.proxy.SegmentHandler
-
- SegmentMergeFilter - Interface in org.apache.nutch.segment
-
Interface used to filter segments during segment merge.
- SegmentMergeFilters - Class in org.apache.nutch.segment
-
This class wraps all
SegmentMergeFilter
extensions in a single object
so it is easier to operate on them.
- SegmentMergeFilters(Configuration) - Constructor for class org.apache.nutch.segment.SegmentMergeFilters
-
- SegmentMerger - Class in org.apache.nutch.segment
-
This tool takes several segments and merges their data together.
- SegmentMerger() - Constructor for class org.apache.nutch.segment.SegmentMerger
-
- SegmentMerger(Configuration) - Constructor for class org.apache.nutch.segment.SegmentMerger
-
- SegmentMerger.ObjectInputFormat - Class in org.apache.nutch.segment
-
Wraps inputs in an
MetaWrapper
, to permit merging different
types in reduce and use additional metadata.
- SegmentMerger.ObjectInputFormat() - Constructor for class org.apache.nutch.segment.SegmentMerger.ObjectInputFormat
-
- SegmentMerger.SegmentOutputFormat - Class in org.apache.nutch.segment
-
- SegmentMerger.SegmentOutputFormat() - Constructor for class org.apache.nutch.segment.SegmentMerger.SegmentOutputFormat
-
- segmentName - Variable in class org.apache.nutch.segment.SegmentPart
-
Name of the segment (just the last path component).
- SegmentPart - Class in org.apache.nutch.segment
-
Utility class for handling information about segment parts.
- SegmentPart() - Constructor for class org.apache.nutch.segment.SegmentPart
-
- SegmentPart(String, String) - Constructor for class org.apache.nutch.segment.SegmentPart
-
- SegmentReader - Class in org.apache.nutch.segment
-
Dump the content of a segment.
- SegmentReader() - Constructor for class org.apache.nutch.segment.SegmentReader
-
- SegmentReader(Configuration, boolean, boolean, boolean, boolean, boolean, boolean) - Constructor for class org.apache.nutch.segment.SegmentReader
-
- SegmentReader.InputCompatMapper - Class in org.apache.nutch.segment
-
- SegmentReader.InputCompatMapper() - Constructor for class org.apache.nutch.segment.SegmentReader.InputCompatMapper
-
- SegmentReader.SegmentReaderStats - Class in org.apache.nutch.segment
-
- SegmentReader.SegmentReaderStats() - Constructor for class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- SegmentReader.TextOutputFormat - Class in org.apache.nutch.segment
-
Implements a text output format
- SegmentReader.TextOutputFormat() - Constructor for class org.apache.nutch.segment.SegmentReader.TextOutputFormat
-
- segnum - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
-
- sendNoOp() - Method in class org.apache.nutch.protocol.ftp.Client
-
Sends a NOOP command to the FTP server.
- SERVER_URL - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- set(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Copy the contents of another instance into this instance.
- set(String, String) - Method in class org.apache.nutch.metadata.Metadata
-
Set metadata name/value.
- set(String, String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
- setAll(Properties) - Method in class org.apache.nutch.metadata.Metadata
-
Copy All key-value pairs from properties.
- setAnchor(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setArgs(String[]) - Method in class org.apache.nutch.parse.ParseStatus
-
- setArgs(String[]) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- setBaseHref(URL) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the baseHref
.
- setBlackList(String) - Method in class org.apache.nutch.collection.Subcollection
-
Set contents of blacklist from String
- setClazz(String) - Method in class org.apache.nutch.plugin.Extension
-
Sets the Class that implement the concret extension and is only used until
model creation at system start up.
- setCode(int) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- setCommand(String) - Method in class org.apache.nutch.util.CommandRunner
-
- setConf(Configuration) - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
- setConf(Configuration) - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
- setConf(Configuration) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- setConf(Configuration) - Method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- setConf(Configuration) - Method in class org.apache.nutch.crawl.Signature
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.indexer.CleaningJob
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
-
handles conf assignment and pulls the value assignment from the
"urlmeta.tags" property
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- setConf(Configuration) - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.microformats.reltag.RelTagParser
-
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
-
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
-
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.ext.ExtParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.feed.FeedParser
-
Sets the
Configuration
object for this
Parser
.
- setConf(Configuration) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.html.HtmlParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.js.JSParseFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.MetaTagsParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.ParserChecker
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.swf.SWFParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.tika.TikaParser
-
- setConf(Configuration) - Method in class org.apache.nutch.parse.zip.ZipParser
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.file.File
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.http.Http
-
Set the Configuration
object.
- setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.Http
-
Reads the configuration from the Nutch configuration files and sets
the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
- setConf(Configuration) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Set the Configuration
object
- setConf(Configuration) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
handles conf assignment and pulls the value assignment from the
"urlmeta.tags" property
- setConf(Configuration) - Method in class org.apache.nutch.segment.SegmentMerger
-
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
Sets the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.domainblacklist.DomainBlacklistURLFilter
-
Sets the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
-
- setConf(Configuration) - Method in class org.apache.nutch.util.GenericWritableConfigurable
-
- setConf(Configuration) - Method in class org.creativecommons.nutch.CCIndexingFilter
-
- setConf(Configuration) - Method in class org.creativecommons.nutch.CCParseFilter
-
- setContent(byte[]) - Method in class org.apache.nutch.protocol.Content
-
- setContent(Content) - Method in class org.apache.nutch.protocol.ProtocolOutput
-
- setContentType(String) - Method in class org.apache.nutch.protocol.Content
-
- setDataTimeout(int) - Method in class org.apache.nutch.protocol.ftp.Client
-
Sets the timeout in milliseconds to use for data connection.
- setDescriptor(PluginDescriptor) - Method in class org.apache.nutch.plugin.Extension
-
Sets the plugin descriptor and is only used until model creation at system
start up.
- setDocumentLocator(Locator) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive an object for locating the origin of SAX document events.
- setFetchInterval(int) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setFetchInterval(float) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
Sets the fetchInterval
and fetchTime
on a
successfully fetched page.
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.DefaultFetchSchedule
-
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Sets the fetchInterval
and fetchTime
on a
successfully fetched page.
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- setFetchTime(long) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Sets either the time of the last fetch or the next fetch time,
depending on whether Fetcher or CrawlDbReducer set the time.
- setFileType(int) - Method in class org.apache.nutch.protocol.ftp.Client
-
Sets the file type to be transferred.
- setFilterFromPath(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- setFollowTalk(boolean) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set followTalk
- setFound(boolean) - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- setId(String) - Method in class org.apache.nutch.plugin.Extension
-
Sets the unique extension Id and is only used until model creation at
system start up.
- setIDAttribute(String, Element) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Set an ID string to node association in the ID table.
- setIgnoreCase(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- setInlinkScore(float) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- setInputStream(InputStream) - Method in class org.apache.nutch.util.CommandRunner
-
- setKeepConnection(boolean) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set keepConnection
- setLastModified(long) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- setLinks(LinkDumper.LinkNode[]) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
-
- setLinkType(byte) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setLookingFor(String) - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- setLoopSet(Set<String>) - Method in class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- setMajorCode(byte) - Method in class org.apache.nutch.parse.ParseStatus
-
- setMaxContentLength(int) - Method in class org.apache.nutch.protocol.file.File
-
Set the length after at which content is truncated.
- setMaxContentLength(int) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the point at which content is truncated.
- setMessage(String) - Method in class org.apache.nutch.parse.ParseStatus
-
- setMessage(String) - Method in class org.apache.nutch.protocol.ProtocolStatus
-
- setMeta(String, String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Set metadata.
- setMetaData(MapWritable) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setMetadata(MapWritable) - Method in class org.apache.nutch.parse.Outlink
-
- setMetadata(Metadata) - Method in class org.apache.nutch.protocol.Content
-
Other protocol-specific data.
- setMetadata(Metadata) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- setMinorCode(short) - Method in class org.apache.nutch.parse.ParseStatus
-
- setModeAccept(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- setModifiedTime(long) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setNoCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets noCache
to true
.
- setNode(Node) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- setNoFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets noFollow
to true
.
- setNoIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets noIndex
to true
.
- setNumInlinks(int) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- setNumOutlinks(int) - Method in class org.apache.nutch.scoring.webgraph.Node
-
- setObject(String, Object) - Method in class org.apache.nutch.util.ObjectCache
-
- setOutlinks(Outlink[]) - Method in class org.apache.nutch.parse.ParseData
-
- setOutlinkUrl(String) - Method in class org.apache.nutch.scoring.webgraph.Loops.Route
-
- setPageGoneSchedule(Text, CrawlDatum, long, long, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method specifies how to schedule refetching of pages
marked as GONE.
- setPageGoneSchedule(Text, CrawlDatum, long, long, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method specifies how to schedule refetching of pages
marked as GONE.
- setPageRetrySchedule(Text, CrawlDatum, long, long, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
- setPageRetrySchedule(Text, CrawlDatum, long, long, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
- setParseMeta(Metadata) - Method in class org.apache.nutch.parse.ParseData
-
- setRefresh(boolean) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets refresh
to the supplied value.
- setRefreshHref(URL) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the refreshHref
.
- setRefreshTime(int) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the refreshTime
.
- setRemoteVerificationEnabled(boolean) - Method in class org.apache.nutch.protocol.ftp.Client
-
Enable or disable verification that the remote host taking part
of a data connection is the same as the host to which the control
connection is attached.
- setRetriesSinceFetch(int) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setScore(float) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setScore(float) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setSignature(byte[]) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setStatus(int) - Method in class org.apache.nutch.crawl.CrawlDatum
-
- setStatus(ProtocolStatus) - Method in class org.apache.nutch.protocol.ProtocolOutput
-
- setStdErrorStream(OutputStream) - Method in class org.apache.nutch.util.CommandRunner
-
- setStdOutputStream(OutputStream) - Method in class org.apache.nutch.util.CommandRunner
-
- setTimeout(int) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the timeout.
- setTimeout(int) - Method in class org.apache.nutch.util.CommandRunner
-
- setTimestamp(long) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setUrl(String) - Method in class org.apache.nutch.parse.Outlink
-
- setUrl(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
-
- setUrl(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
-
- setWaitForExit(boolean) - Method in class org.apache.nutch.util.CommandRunner
-
- setWeight(float) - Method in class org.apache.nutch.indexer.NutchDocument
-
- setWeight(float) - Method in class org.apache.nutch.indexer.NutchField
-
- setWhiteList(ArrayList<String>) - Method in class org.apache.nutch.collection.Subcollection
-
- setWhiteList(String) - Method in class org.apache.nutch.collection.Subcollection
-
Set contents of whitelist from String
- shortestMatch(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns the shortest prefix of input that is matched,
or null if no match exists.
- shortestMatch(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns the shortest suffix of input that is matched,
or null if no match exists.
- shortestMatch(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns the shortest substring of input that is
matched by a pattern in the trie, or null if no match
exists.
- shouldFetch(Text, CrawlDatum, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method provides information whether the page is suitable for
selection in the current fetchlist.
- shouldFetch(Text, CrawlDatum, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method provides information whether the page is suitable for
selection in the current fetchlist.
- shutDown() - Method in class org.apache.nutch.plugin.Plugin
-
Shutdown the plugin.
- Signature - Class in org.apache.nutch.crawl
-
- Signature() - Constructor for class org.apache.nutch.crawl.Signature
-
- SIGNATURE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
- SignatureComparator - Class in org.apache.nutch.crawl
-
- SignatureComparator() - Constructor for class org.apache.nutch.crawl.SignatureComparator
-
- SignatureFactory - Class in org.apache.nutch.crawl
-
Factory class, which instantiates a Signature implementation according to the
current Configuration configuration.
- size() - Method in class org.apache.nutch.crawl.Inlinks
-
- size() - Method in class org.apache.nutch.crawl.MapWritable
-
Deprecated.
- size() - Method in class org.apache.nutch.metadata.Metadata
-
Returns the number of metadata names in this metadata.
- size() - Method in class org.apache.nutch.parse.ParseResult
-
Return the number of parse outputs (both successful and failed)
- skip(DataInput) - Static method in class org.apache.nutch.crawl.Inlink
-
Skips over one Inlink in the input.
- skip(DataInput) - Static method in class org.apache.nutch.parse.Outlink
-
Skips over one Outlink in the input.
- SKIP_TRUNCATED - Static variable in class org.apache.nutch.parse.ParseSegment
-
- skipChildren() - Method in class org.apache.nutch.util.NodeWalker
-
Skips over and removes from the node stack the children of the last
node.
- skippedEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of a skipped entity.
- SOLR_PREFIX - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
-
- SolrConstants - Interface in org.apache.nutch.indexwriter.solr
-
- SolrIndexWriter - Class in org.apache.nutch.indexwriter.solr
-
- SolrIndexWriter() - Constructor for class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
- SolrMappingReader - Class in org.apache.nutch.indexwriter.solr
-
- SolrMappingReader(Configuration) - Constructor for class org.apache.nutch.indexwriter.solr.SolrMappingReader
-
- SolrUtils - Class in org.apache.nutch.indexwriter.solr
-
- SolrUtils() - Constructor for class org.apache.nutch.indexwriter.solr.SolrUtils
-
- SOURCE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A reference to a resource from which the present resource is derived.
- SpellCheckedMetadata - Class in org.apache.nutch.metadata
-
A decorator to Metadata that adds spellchecking capabilities to property
names.
- SpellCheckedMetadata() - Constructor for class org.apache.nutch.metadata.SpellCheckedMetadata
-
- splitEnd - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- splitLen - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- splitStart - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- start - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- startCDATA() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the start of a CDATA section.
- startDocument() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of the beginning of a document.
- startDTD(String, String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the start of DTD declarations, if any.
- startElement(String, String, String, Attributes) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of the beginning of an element.
- startEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the beginning of an entity.
- startPrefixMapping(String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Begin the scope of a prefix-URI Namespace mapping.
- startUp() - Method in class org.apache.nutch.plugin.Plugin
-
Will be invoked until plugin start up.
- StaticFieldIndexer - Class in org.apache.nutch.indexer.staticfield
-
A simple plugin called at indexing that adds fields with static data.
- StaticFieldIndexer() - Constructor for class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
- statNames - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- STATUS_BLOCKED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_DB_DUPLICATE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- STATUS_DB_FETCHED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was successfully fetched.
- STATUS_DB_GONE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page no longer exists.
- STATUS_DB_MAX - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Maximum value of DB-related status.
- STATUS_DB_NOTMODIFIED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was successfully fetched and found not modified.
- STATUS_DB_REDIR_PERM - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page permanently redirects to other page.
- STATUS_DB_REDIR_TEMP - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page temporarily redirects to other page.
- STATUS_DB_UNFETCHED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was not fetched yet.
- STATUS_FAILED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_FAILURE - Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_FETCH_GONE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching unsuccessful - page is gone.
- STATUS_FETCH_MAX - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Maximum value of fetch-related status.
- STATUS_FETCH_NOTMODIFIED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching successful - page is not modified.
- STATUS_FETCH_REDIR_PERM - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching permanently redirected to other page.
- STATUS_FETCH_REDIR_TEMP - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching temporarily redirected to other page.
- STATUS_FETCH_RETRY - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching unsuccessful, needs to be retried (transient errors).
- STATUS_FETCH_SUCCESS - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching was successful.
- STATUS_GONE - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_INJECTED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was newly injected.
- STATUS_LINKED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page discovered through a link.
- STATUS_MODIFIED - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
Page is known to have been modified since our last visit.
- STATUS_NOTFETCHING - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTFOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTMODIFIED - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
Page is known to remain unmodified since our last visit.
- STATUS_NOTMODIFIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTPARSED - Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_PARSE_META - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page got metadata from a parser
- STATUS_REDIR_EXCEEDED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_RETRY - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_ROBOTS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_SIGNATURE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page signature.
- STATUS_SUCCESS - Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_SUCCESS - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_UNKNOWN - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
It is unknown whether page was changed since our last visit.
- STATUS_WOULDBLOCK - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- StringUtil - Class in org.apache.nutch.util
-
A collection of String processing utility methods.
- StringUtil() - Constructor for class org.apache.nutch.util.StringUtil
-
- stripNonCharCodepoints(String) - Static method in class org.apache.nutch.indexwriter.solr.SolrUtils
-
- Subcollection - Class in org.apache.nutch.collection
-
SubCollection represents a subset of index, you can define url patterns that
will indicate that particular page (url) is part of SubCollection.
- Subcollection(String, String, Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
-
public Constructor
- Subcollection(String, String, String, Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
-
public Constructor
- Subcollection(Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
-
- SubcollectionIndexingFilter - Class in org.apache.nutch.indexer.subcollection
-
- SubcollectionIndexingFilter() - Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- SubcollectionIndexingFilter(Configuration) - Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- SUBJECT - Static variable in interface org.apache.nutch.metadata.DublinCore
-
The topic of the content of the resource.
- SUCCESS - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing succeeded.
- SUCCESS - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Content was retrieved without errors.
- SUCCESS_REDIRECT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsed content contains a directive to redirect to another URL.
- SuffixStringMatcher - Class in org.apache.nutch.util
-
A class for efficiently matching String
s against a set
of suffixes.
- SuffixStringMatcher(String[]) - Constructor for class org.apache.nutch.util.SuffixStringMatcher
-
Creates a new PrefixStringMatcher
which will match
String
s with any suffix in the supplied array.
- SuffixStringMatcher(Collection<String>) - Constructor for class org.apache.nutch.util.SuffixStringMatcher
-
Creates a new PrefixStringMatcher
which will match
String
s with any suffix in the supplied
Collection
- SuffixURLFilter - Class in org.apache.nutch.urlfilter.suffix
-
Filters URLs based on a file of URL suffixes.
- SuffixURLFilter() - Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- SuffixURLFilter(Reader) - Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- SWFParser - Class in org.apache.nutch.parse.swf
-
Parser for Flash SWF files.
- SWFParser() - Constructor for class org.apache.nutch.parse.swf.SWFParser
-