Modifier and Type | Field and Description |
---|---|
protected String |
accept
The "Accept" request header value.
|
protected String |
acceptLanguage
The "Accept-Language" request header value.
|
static int |
BUFFER_SIZE |
protected int |
maxContent
The length limit for downloaded content, in bytes.
|
protected long |
maxCrawlDelay
Skip page if Crawl-Delay longer than this value.
|
protected String |
proxyHost
The proxy hostname.
|
protected int |
proxyPort
The proxy port.
|
static org.apache.hadoop.io.Text |
RESPONSE_TIME |
protected boolean |
responseTime
Record response time in CrawlDatum's meta data, see property
http.store.responsetime.
|
protected int |
timeout
The network timeout in millisecond
|
protected boolean |
useHttp11
Do we use HTTP/1.1?
|
protected boolean |
useProxy
Indicates if a proxy is used
|
protected String |
userAgent
The Nutch 'User-Agent' request header
|
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
Constructor and Description |
---|
HttpBase()
Creates a new instance of HttpBase
|
HttpBase(org.slf4j.Logger logger)
Creates a new instance of HttpBase
|
Modifier and Type | Method and Description |
---|---|
String |
getAccept() |
String |
getAcceptLanguage()
Value of "Accept-Language" request header sent by Nutch.
|
org.apache.hadoop.conf.Configuration |
getConf() |
int |
getMaxContent() |
ProtocolOutput |
getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Returns the
Content for a fetchlist entry. |
String |
getProxyHost() |
int |
getProxyPort() |
protected abstract Response |
getResponse(URL url,
CrawlDatum datum,
boolean followRedirects) |
crawlercommons.robots.BaseRobotRules |
getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Retrieve robot rules applicable for this url.
|
int |
getTimeout() |
boolean |
getUseHttp11() |
String |
getUserAgent() |
protected void |
logConf() |
protected static void |
main(HttpBase http,
String[] args) |
byte[] |
processDeflateEncoded(byte[] compressed,
URL url) |
byte[] |
processGzipEncoded(byte[] compressed,
URL url) |
void |
setConf(org.apache.hadoop.conf.Configuration conf) |
boolean |
useProxy() |
public static final org.apache.hadoop.io.Text RESPONSE_TIME
public static final int BUFFER_SIZE
protected String proxyHost
protected int proxyPort
protected boolean useProxy
protected int timeout
protected int maxContent
protected String userAgent
protected String acceptLanguage
protected String accept
protected boolean useHttp11
protected boolean responseTime
protected long maxCrawlDelay
public HttpBase()
public HttpBase(org.slf4j.Logger logger)
public void setConf(org.apache.hadoop.conf.Configuration conf)
setConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
getConf
in interface org.apache.hadoop.conf.Configurable
public ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum)
Protocol
Content
for a fetchlist entry.getProtocolOutput
in interface Protocol
public String getProxyHost()
public int getProxyPort()
public boolean useProxy()
public int getTimeout()
public int getMaxContent()
public String getUserAgent()
public String getAcceptLanguage()
public String getAccept()
public boolean getUseHttp11()
protected void logConf()
public byte[] processGzipEncoded(byte[] compressed, URL url) throws IOException
IOException
public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException
IOException
protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException
ProtocolException
IOException
public crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum)
Protocol
getRobotRules
in interface Protocol
url
- url to checkdatum
- page datumCopyright © 2014 The Apache Software Foundation