file.content.limit
65536
The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
file.content.ignored
true
If true, no file content will be saved during fetch.
And it is probably what we want to set most of time, since file:// URLs
are meant to be local and we can always use them directly at parsing
and indexing stages. Otherwise file contents will be saved.
!! NO IMPLEMENTED YET !!
http.agent.name
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
http.robots.agents
*
The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
http.robots.403.allow
true
Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.
http.agent.description
Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
http.agent.url
A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
http.agent.email
An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
http.agent.version
Nutch-0.8.1
A version string to advertise in the User-Agent
header.
http.timeout
10000
The default network timeout, in milliseconds.
http.max.delays
100
The number of times a thread will delay when trying to
fetch a page. Each time it finds that a host is busy, it will wait
fetcher.server.delay. After http.max.delays attepts, it will give
up on the page for now.
http.content.limit
65536
The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
http.proxy.host
The proxy hostname. If empty, no proxy is used.
http.proxy.port
The proxy port.
http.verbose
false
If true, HTTP will log more verbosely.
http.redirect.max
3
The maximum number of redirects the fetcher will follow when
trying to fetch a page.
http.useHttp11
false
NOTE: at the moment this works only for protocol-httpclient.
If true, use HTTP 1.1, if false use HTTP 1.0 .
ftp.username
anonymous
ftp login username.
ftp.password
anonymous@example.com
ftp login password.
ftp.content.limit
65536
The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
Caution: classical ftp RFCs never defines partial transfer and, in fact,
some ftp servers out there do not handle client side forced close-down very
well. Our implementation tries its best to handle such situations smoothly.
ftp.timeout
60000
Default timeout for ftp client socket, in millisec.
Please also see ftp.keep.connection below.
ftp.server.timeout
100000
An estimation of ftp server idle time, in millisec.
Typically it is 120000 millisec for many ftp servers out there.
Better be conservative here. Together with ftp.timeout, it is used to
decide if we need to delete (annihilate) current ftp.client instance and
force to start another ftp.client instance anew. This is necessary because
a fetcher thread may not be able to obtain next request from queue in time
(due to idleness) before our ftp client times out or remote server
disconnects. Used only when ftp.keep.connection is true (please see below).
ftp.keep.connection
false
Whether to keep ftp connection. Useful if crawling same host
again and again. When set to true, it avoids connection, login and dir list
parser setup for subsequent urls. If it is set to true, however, you must
make sure (roughly):
(1) ftp.timeout is less than ftp.server.timeout
(2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay)
Otherwise there will be too many "delete client because idled too long"
messages in thread logs.
ftp.follow.talk
false
Whether to log dialogue between our client and remote
server. Useful for debugging.
db.default.fetch.interval
30
The default number of days between re-fetches of a page.
db.ignore.internal.links
true
If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
db.ignore.external.links
false
If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
db.score.injected
1.0
The score of new pages added by the injector.
db.score.link.external
1.0
The score factor for new pages added due to a link from
another host relative to the referencing page's score. Scoring plugins
may use this value to affect initial scores of external links.
db.score.link.internal
1.0
The score factor for pages added due to a link from the
same host, relative to the referencing page's score. Scoring plugins
may use this value to affect initial scores of internal links.
db.score.count.filtered
false
The score value passed to newly discovered pages is
calculated as a fraction of the original page score divided by the
number of outlinks. If this option is false, only the outlinks that passed
URLFilters will count, if it's true then all outlinks will count.
db.max.inlinks
10000
Maximum number of Inlinks per URL to be kept in LinkDb.
If "invertlinks" finds more inlinks than this number, only the first
N inlinks will be stored, and the rest will be discarded.
db.max.outlinks.per.page
100
The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
db.max.anchor.length
100
The maximum number of characters permitted in an anchor.
db.fetch.retry.max
3
The maximum number of times a url that has encountered
recoverable errors is generated for fetch.
db.signature.class
org.apache.nutch.crawl.MD5Signature
The default implementation of a page signature. Signatures
created with this implementation will be used for duplicate detection
and removal.
db.signature.text_profile.min_token_len
2
Minimum token length to be included in the signature.
db.signature.text_profile.quant_rate
0.01
Profile frequencies will be rounded down to a multiple of
QUANT = (int)(QUANT_RATE * maxFreq), where maxFreq is a maximum token
frequency. If maxFreq > 1 then QUANT will be at least 2, which means that
for longer texts tokens with frequency 1 will always be discarded.
generate.max.per.host
-1
The maximum number of urls per host in a single
fetchlist. -1 if unlimited.
generate.max.per.host.by.ip
false
If false, same host names are counted. If true,
hosts' IP addresses are resolved and the same IP-s are counted.
-+-+-+- WARNING !!! -+-+-+-
When set to true, Generator will create a lot of DNS lookup
requests, rapidly. This may cause a DOS attack on
remote DNS servers, not to mention increased external traffic
and latency. For these reasons when using this option it is
required that a local caching DNS be used.
fetcher.server.delay
5.0
The number of seconds the fetcher will delay between
successive requests to the same server.
fetcher.max.crawl.delay
30
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be.
fetcher.threads.fetch
10
The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).
fetcher.threads.per.host
1
This number is the maximum number of threads that
should be allowed to access a host at one time.
fetcher.threads.per.host.by.ip
true
If true, then fetcher will count threads by IP address,
to which the URL's host name resolves. If false, only host name will be
used. NOTE: this should be set to the same value as
"generate.max.per.host.by.ip" - default settings are different only for
reasons of backward-compatibility.
fetcher.verbose
false
If true, fetcher will log more verbosely.
fetcher.parse
true
If true, fetcher will parse content.
fetcher.store.content
true
If true, fetcher will store content.
indexer.score.power
0.5
Determines the power of link analyis scores. Each
pages's boost is set to scorescorePower where
score is its link analysis score and scorePower is the
value of this parameter. This is compiled into indexes, so, when
this is changed, pages must be re-indexed for it to take
effect.
indexer.max.title.length
100
The maximum number of characters of a title that are indexed.
indexer.max.tokens
10000
The maximum number of tokens that will be indexed for a single field
in a document. This limits the amount of memory required for
indexing, so that collections with very large files will not crash
the indexing process by running out of memory.
Note that this effectively truncates large documents, excluding
from the index tokens that occur further in the document. If you
know your source documents are large, be sure to set this value
high enough to accomodate the expected size. If you set it to
Integer.MAX_VALUE, then the only limit is your memory, but you
should anticipate an OutOfMemoryError.
indexer.mergeFactor
50
The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
indexer.minMergeDocs
50
This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
indexer.maxMergeDocs
2147483647
This number determines the maximum number of Lucene
Documents to be merged into a new Lucene segment. Larger values
increase batch indexing speed and reduce the number of Lucene segments,
which reduces the number of open file handles; however, this also
decreases incremental indexing performance.
indexer.termIndexInterval
128
Determines the fraction of terms which Lucene keeps in
RAM when searching, to facilitate random-access. Smaller values use
more memory but make searches somewhat faster. Larger values use
less memory but make searches somewhat slower.
analysis.common.terms.file
common-terms.utf8
The name of a file containing a list of common terms
that should be indexed in n-grams.
searcher.dir
crawl
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
searcher.filter.cache.size
16
Maximum number of filters to cache. Filters can accelerate certain
field-based queries, like language, document format, etc. Each
filter requires one bit of RAM per page. So, with a 10 million page
index, a cache size of 16 consumes two bytes per page, or 20MB.
searcher.filter.cache.threshold
0.05
Filters are cached when their term is matched by more than this
fraction of pages. For example, with a threshold of 0.05, and 10
million pages, the term must match more than 1/20, or 50,000 pages.
So, if out of 10 million pages, 50% of pages are in English, and 2%
are in Finnish, then, with a threshold of 0.05, searches for
"lang:en" will use a cached filter, while searches for "lang:fi"
will score all 20,000 finnish documents.
searcher.hostgrouping.rawhits.factor
2.0
A factor that is used to determine the number of raw hits
initially fetched, before host grouping is done.
searcher.summary.context
5
The number of context terms to display preceding and following
matching terms in a hit summary.
searcher.summary.length
20
The total number of terms to display in a hit summary.
searcher.max.hits
-1
If positive, search stops after this many hits are
found. Setting this to small, positive values (e.g., 1000) can make
searches much faster. With a sorted index, the quality of the hits
suffers little.
searcher.max.time.tick_count
-1
If positive value is defined here, limit search time for
every request to this number of elapsed ticks (see the tick_length
property below). The total maximum time for any search request will be
then limited to tick_count * tick_length milliseconds. When search time
is exceeded, partial results will be returned, and the total number of
hits will be estimated.
searcher.max.time.tick_length
200
The number of milliseconds between ticks. Larger values
reduce the timer granularity (precision). Smaller values bring more
overhead.
urlnormalizer.class
org.apache.nutch.net.BasicUrlNormalizer
Name of the class used to normalize URLs.
urlnormalizer.regex.file
regex-normalize.xml
Name of the config file used by the RegexUrlNormalizer class.
mime.types.file
mime-types.xml
Name of file in CLASSPATH containing filename extension and
magic sequence to mime types mapping information
mime.type.magic
true
Defines if the mime content type detector uses magic resolution.
plugin.folders
plugins
Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.
plugin.auto-activation
true
Defines if some plugins that are not activated regarding
the plugin.includes and plugin.excludes properties must be automaticaly
activated if they are needed by some actived plugins.
plugin.includes
protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic
Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
plugin.excludes
Regular expression naming plugin directory names to exclude.
parse.plugin.file
parse-plugins.xml
The name of the file that defines the associations between
content-types and parsers.
parser.character.encoding.default
windows-1252
The character encoding to fall back to when no other information
is available
parser.html.impl
neko
HTML Parser implementation. Currently the following keywords
are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
parser.html.form.use_action
false
If true, HTML parser will collect URLs from form action
attributes. This may lead to undesirable behavior (submitting empty
forms during next fetch cycle). If false, form action attribute will
be ignored.
urlfilter.regex.file
regex-urlfilter.txt
Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.
urlfilter.automaton.file
automaton-urlfilter.txt
Name of file on CLASSPATH containing regular expressions
used by urlfilter-automaton (AutomatonURLFilter) plugin.
urlfilter.prefix.file
prefix-urlfilter.txt
Name of file on CLASSPATH containing url prefixes
used by urlfilter-prefix (PrefixURLFilter) plugin.
urlfilter.suffix.file
suffix-urlfilter.txt
Name of file on CLASSPATH containing url suffixes
used by urlfilter-suffix (SuffixURLFilter) plugin.
urlfilter.order
The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
scoring.filter.order
The order in which scoring filters are applied.
This may be left empty (in which case all available scoring
filters will be applied in the order defined in plugin-includes
and plugin-excludes), or a space separated list of implementation
classes.
extension.clustering.hits-to-cluster
100
Number of snippets retrieved for the clustering extension
if clustering extension is available and user requested results
to be clustered.
extension.clustering.extension-name
Use the specified online clustering extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.
extension.ontology.extension-name
Use the specified online ontology extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.
extension.ontology.urls
Urls of owl files, separated by spaces, such as
http://www.example.com/ontology/time.owl
http://www.example.com/ontology/space.owl
http://www.example.com/ontology/wine.owl
Or
file:/ontology/time.owl
file:/ontology/space.owl
file:/ontology/wine.owl
You have to make sure each url is valid.
By default, there is no owl file, so query refinement based on ontology
is silently ignored.
query.url.boost
4.0
Used as a boost for url field in Lucene query.
query.anchor.boost
2.0
Used as a boost for anchor field in Lucene query.
query.title.boost
1.5
Used as a boost for title field in Lucene query.
query.host.boost
2.0
Used as a boost for host field in Lucene query.
query.phrase.boost
1.0
Used as a boost for phrase in Lucene query.
Multiplied by boost for field phrase is matched in.
query.cc.boost
0.0
Used as a boost for cc field in Lucene query.
query.type.boost
0.0
Used as a boost for type field in Lucene query.
query.site.boost
0.0
Used as a boost for site field in Lucene query.
query.tag.boost
1.0
Used as a boost for tag field in Lucene query.
lang.ngram.min.length
1
The minimum size of ngrams to uses to identify
language (must be between 1 and lang.ngram.max.length).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
lang.ngram.max.length
4
The maximum size of ngrams to uses to identify
language (must be between lang.ngram.min.length and 4).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
lang.analyze.max.length
2048
The maximum bytes of data to uses to indentify
the language (0 means full content analysis).
The larger is this value, the better is the analysis, but the
slowest it is.
query.lang.boost
0.0
Used as a boost for lang field in Lucene query.