apache-nutch 1.5 API

Apache Nutch is an open source web-search software project.

See:
          Description

Core
org.apache.nutch.crawl Crawl control code.
org.apache.nutch.fetcher The Nutch robot.
org.apache.nutch.indexer Maintain Lucene full-text indexes.
org.apache.nutch.indexer.more A more indexing plugin.
org.apache.nutch.indexer.solr  
org.apache.nutch.metadata A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.microformats.reltag A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.apache.nutch.net  
org.apache.nutch.net.protocols  
org.apache.nutch.parse  
org.apache.nutch.parse.ext  
org.apache.nutch.parse.feed  
org.apache.nutch.parse.html An HTML document parsing plugin.
org.apache.nutch.parse.js  
org.apache.nutch.parse.swf  
org.apache.nutch.parse.tika  
org.apache.nutch.parse.zip  
org.apache.nutch.plugin The Nutch Plugin System.
org.apache.nutch.protocol  
org.apache.nutch.scoring  
org.apache.nutch.scoring.webgraph  
org.apache.nutch.segment  
org.apache.nutch.tools  
org.apache.nutch.tools.arc  
org.apache.nutch.tools.proxy  
org.apache.nutch.urlfilter.automaton A url filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.
org.apache.nutch.urlfilter.domain A url filter plugin that filters by domain.
org.apache.nutch.urlfilter.domainblacklist  
org.apache.nutch.urlfilter.prefix A url filter plugin.
org.apache.nutch.urlfilter.regex A url filter plugin.
org.apache.nutch.urlfilter.suffix  
org.apache.nutch.urlfilter.validator A url filter plugin that validates given urls.
org.apache.nutch.util  
org.apache.nutch.util.domain org.apache.nutch.util.domain

 

Plugins API
org.apache.nutch.protocol.http.api Common API used by HTTP plugins (http, httpclient)
org.apache.nutch.urlfilter.api  

 

Protocol Plugins
org.apache.nutch.protocol.file Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.http Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.httpclient Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.

 

URL Filter Plugins
org.apache.nutch.net.urlnormalizer.basic  
org.apache.nutch.net.urlnormalizer.pass  
org.apache.nutch.net.urlnormalizer.regex  

 

Scoring Plugins
org.apache.nutch.scoring.link  
org.apache.nutch.scoring.opic  
org.apache.nutch.scoring.tld Top Level Domain Scoring plugin.
org.apache.nutch.scoring.urlmeta URL Meta Tag Scoring Plugin

 

Parse Plugins
org.apache.nutch.parse.headings  

 

Indexing Filter Plugins
org.apache.nutch.indexer.anchor An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic A basic indexing plugin.
org.apache.nutch.indexer.feed  
org.apache.nutch.indexer.metadata  
org.apache.nutch.indexer.staticfield A simple plugin called at indexing that adds fields with static data.
org.apache.nutch.indexer.subcollection  
org.apache.nutch.indexer.tld Top Level Domain Indexing plugin.
org.apache.nutch.indexer.urlmeta URL Meta Tag Indexing Plugin

 

Misc. Plugins
org.apache.nutch.analysis.lang Text document language identifier.
org.apache.nutch.collection Subcollection is a subset of an index.
org.creativecommons.nutch Sample plugins that parse and index Creative Commons medadata.

 

Apache Nutch is an open source web-search software project.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.



Copyright © 2012 The Apache Software Foundation