apache-nutch 1.8 API

Apache Nutch is an open source web-search software project.

See: Description

Core 
Package Description
org.apache.nutch.crawl
Crawl control code.
org.apache.nutch.fetcher
The Nutch robot.
org.apache.nutch.indexer
Maintain Lucene full-text indexes.
org.apache.nutch.indexer.more
A more indexing plugin.
org.apache.nutch.metadata
A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.microformats.reltag
A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.apache.nutch.net  
org.apache.nutch.net.protocols  
org.apache.nutch.parse  
org.apache.nutch.parse.ext  
org.apache.nutch.parse.feed  
org.apache.nutch.parse.html
An HTML document parsing plugin.
org.apache.nutch.parse.js  
org.apache.nutch.parse.swf  
org.apache.nutch.parse.tika  
org.apache.nutch.parse.zip  
org.apache.nutch.plugin
The Nutch Plugin System.
org.apache.nutch.protocol  
org.apache.nutch.scoring  
org.apache.nutch.scoring.webgraph  
org.apache.nutch.segment  
org.apache.nutch.tools  
org.apache.nutch.tools.arc  
org.apache.nutch.tools.proxy  
org.apache.nutch.urlfilter.automaton
A url filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.
org.apache.nutch.urlfilter.domain
A url filter plugin that filters by domain.
org.apache.nutch.urlfilter.domainblacklist  
org.apache.nutch.urlfilter.prefix
A url filter plugin.
org.apache.nutch.urlfilter.regex
A url filter plugin.
org.apache.nutch.urlfilter.suffix  
org.apache.nutch.urlfilter.validator
A url filter plugin that validates given urls.
org.apache.nutch.util  
org.apache.nutch.util.domain
org.apache.nutch.util.domain
Plugins API 
Package Description
org.apache.nutch.protocol.http.api
Common API used by HTTP plugins (http, httpclient)
org.apache.nutch.urlfilter.api  
Protocol Plugins 
Package Description
org.apache.nutch.protocol.file
Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp
Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.http
Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.httpclient
Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
URL Filter Plugins 
Package Description
org.apache.nutch.net.urlnormalizer.basic  
org.apache.nutch.net.urlnormalizer.pass  
org.apache.nutch.net.urlnormalizer.regex  
Scoring Plugins 
Package Description
org.apache.nutch.scoring.link  
org.apache.nutch.scoring.opic  
org.apache.nutch.scoring.tld
Top Level Domain Scoring plugin.
org.apache.nutch.scoring.urlmeta
URL Meta Tag Scoring Plugin
Parse Plugins 
Package Description
org.apache.nutch.parse.headings  
Indexing Filter Plugins 
Package Description
org.apache.nutch.indexer.anchor
An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic
A basic indexing plugin.
org.apache.nutch.indexer.feed  
org.apache.nutch.indexer.metadata  
org.apache.nutch.indexer.staticfield
A simple plugin called at indexing that adds fields with static data.
org.apache.nutch.indexer.subcollection  
org.apache.nutch.indexer.tld
Top Level Domain Indexing plugin.
org.apache.nutch.indexer.urlmeta
URL Meta Tag Indexing Plugin
Indexer Plugins 
Package Description
org.apache.nutch.indexwriter.solr  
Misc. Plugins 
Package Description
org.apache.nutch.analysis.lang
Text document language identifier.
org.apache.nutch.collection
Subcollection is a subset of an index.
org.creativecommons.nutch
Sample plugins that parse and index Creative Commons medadata.

Copyright © 2014 The Apache Software Foundation