Modifier and Type | Method and Description |
---|---|
ParseResult |
HTMLLanguageParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the HTML document looking at possible indications of content
language
1. |
Modifier and Type | Method and Description |
---|---|
abstract byte[] |
Signature.calculate(Content content,
Parse parse) |
byte[] |
MD5Signature.calculate(Content content,
Parse parse) |
byte[] |
TextProfileSignature.calculate(Content content,
Parse parse) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
RelTagParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the HTML document looking at possible rel-tags
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
MetaTagsParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
ParseResult |
HtmlParseFilters.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Run all defined filters.
|
ParseResult |
HtmlParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given
the DOM tree of a page.
|
ParseResult |
Parser.getParse(Content c)
This method parses the given content and returns a map of
<key, parse> pairs.
|
static boolean |
ParseSegment.isTruncated(Content content)
Checks if the page's content is truncated.
|
void |
ParseSegment.map(org.apache.hadoop.io.WritableComparable<?> key,
Content content,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,ParseImpl> output,
org.apache.hadoop.mapred.Reporter reporter) |
ParseResult |
ParseUtil.parse(Content content)
|
ParseResult |
ParseUtil.parseByExtensionId(String extId,
Content content)
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
ExtParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
FeedParser.getParse(Content content)
Parses the given feed and extracts out and parsers all linked items within
the feed, using the underlying ROME feed parsing library.
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
HeadingsParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
HtmlParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
JSParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
ParseResult |
JSParseFilter.getParse(Content c) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
SWFParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
TikaParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
ZipParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
Content |
ProtocolOutput.getContent() |
static Content |
Content.read(DataInput in) |
Modifier and Type | Method and Description |
---|---|
void |
ProtocolOutput.setContent(Content content) |
Constructor and Description |
---|
ProtocolOutput(Content content) |
ProtocolOutput(Content content,
ProtocolStatus status) |
Modifier and Type | Method and Description |
---|---|
Content |
FileResponse.toContent() |
Modifier and Type | Method and Description |
---|---|
Content |
FtpResponse.toContent() |
Modifier and Type | Method and Description |
---|---|
void |
ScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Currently a part of score distribution is performed using only data coming
from the parsing process.
|
void |
AbstractScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) |
void |
ScoringFilters.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) |
void |
ScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum
(coming from a generated fetchlist) and stores it into
Content metadata. |
void |
AbstractScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content) |
void |
ScoringFilters.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content) |
Modifier and Type | Method and Description |
---|---|
void |
LinkAnalysisScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) |
void |
LinkAnalysisScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content) |
Modifier and Type | Method and Description |
---|---|
void |
OPICScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
|
void |
OPICScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
|
Modifier and Type | Method and Description |
---|---|
void |
TLDScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) |
void |
TLDScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content) |
Modifier and Type | Method and Description |
---|---|
void |
URLMetaScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Takes the metadata, which was lumped inside the content, and replicates it
within your parse data.
|
void |
URLMetaScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Takes the metadata, specified in your "urlmeta.tags" property, from the
datum object and injects it into the content.
|
Modifier and Type | Method and Description |
---|---|
boolean |
SegmentMergeFilter.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given
key (URL).
|
boolean |
SegmentMergeFilters.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
Iterates over all
SegmentMergeFilter extensions and if any of them
returns false, it will return false as well. |
Modifier and Type | Method and Description |
---|---|
void |
EncodingDetector.autoDetectClues(Content content,
boolean filter) |
String |
EncodingDetector.guessEncoding(Content content,
String defaultValue)
Guess the encoding with the previously specified list of clues.
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
CCParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of an HTML document, given
the DOM tree of a page.
|
Copyright © 2014 The Apache Software Foundation