Overview (apache-nutch 1.13 API)

Core
Package	Description
org.apache.nutch.crawl	Crawl control code and tools to run the crawler.
org.apache.nutch.fetcher	The Nutch robot.
org.apache.nutch.hostdb
org.apache.nutch.indexer	Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
org.apache.nutch.indexer.links
org.apache.nutch.indexer.replace	Indexing filter to allow pattern replacements on metadata.
org.apache.nutch.metadata	A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.net	Web-related interfaces: URL `filters` and `normalizers`.
org.apache.nutch.net.protocols	Helper classes related to the `Protocol` interface, sea also `org.apache.nutch.protocol`.
org.apache.nutch.net.urlnormalizer.slash
org.apache.nutch.parse	The `Parse` interface and related classes.
org.apache.nutch.plugin	The Nutch `Plugin` System.
org.apache.nutch.protocol	Classes related to the `Protocol` interface, see also `org.apache.nutch.net.protocols`.
org.apache.nutch.publisher
org.apache.nutch.publisher.rabbitmq	Publisher package to implement queues
org.apache.nutch.scoring	The `ScoringFilter` interface.
org.apache.nutch.scoring.webgraph	Scoring implementation based on link analysis (`LinkRank`), see `WebGraph`.
org.apache.nutch.segment	A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
org.apache.nutch.service
org.apache.nutch.service.impl
org.apache.nutch.service.model.request
org.apache.nutch.service.model.response
org.apache.nutch.service.resources
org.apache.nutch.tools	Miscellaneous tools.
org.apache.nutch.tools.arc	Tools to read the Arc file format.
org.apache.nutch.tools.warc	Tools to import / export between Nutch segments and WARC archives.
org.apache.nutch.urlfilter.ignoreexempt	URL filter plugin which identifies exemptions to external urls when when external urls are set to ignore.
org.apache.nutch.util	Miscellaneous utility classes.
org.apache.nutch.util.domain	Classes for domain name analysis.
org.apache.nutch.webui
org.apache.nutch.webui.client
org.apache.nutch.webui.client.impl
org.apache.nutch.webui.client.model
org.apache.nutch.webui.config
org.apache.nutch.webui.model
org.apache.nutch.webui.pages
org.apache.nutch.webui.pages.assets
org.apache.nutch.webui.pages.components
org.apache.nutch.webui.pages.crawls
org.apache.nutch.webui.pages.instances
org.apache.nutch.webui.pages.menu
org.apache.nutch.webui.pages.seed
org.apache.nutch.webui.pages.settings
org.apache.nutch.webui.service
org.apache.nutch.webui.service.impl

Plugins API
Package	Description
org.apache.nutch.protocol.http.api	Common API used by HTTP plugins (`http`, `httpclient`)
org.apache.nutch.urlfilter.api	Generic `URL filter` library, abstracting away from regular expression implementations.

Protocol Plugins
Package	Description
org.apache.nutch.protocol.file	Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp	Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.htmlunit	Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.http	Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.httpclient	Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
org.apache.nutch.protocol.selenium	Protocol plugin which supports retrieving documents via selenium.

URL Filter Plugins
Package	Description
org.apache.nutch.urlfilter.automaton	URL filter plugin based on dk.brics.automaton Finite-State Automata for Java^TM.
org.apache.nutch.urlfilter.domain	URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.domainblacklist	URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.prefix	URL filter plugin to include only URLs which match one of a given list of URL prefixes.
org.apache.nutch.urlfilter.regex	URL filter plugin to include and/or exclude URLs matching Java regular expressions.
org.apache.nutch.urlfilter.suffix	URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
org.apache.nutch.urlfilter.validator	URL filter plugin that validates given urls.

URL Normalizer Plugins
Package	Description
org.apache.nutch.net.urlnormalizer.basic	URL normalizer performing basic normalizations: remove default ports and dot segments in path.
org.apache.nutch.net.urlnormalizer.host	URL normalizer renaming hosts to a canonical form listed in the configuration file.
org.apache.nutch.net.urlnormalizer.pass	URL normalizer dummy which does not change URLs.
org.apache.nutch.net.urlnormalizer.protocol
org.apache.nutch.net.urlnormalizer.querystring	URL normalizer which sort the elements in the query part to avoid duplicates by permutations.
org.apache.nutch.net.urlnormalizer.regex	URL normalizer with configurable rules based on regular expressions (`Pattern`).

Scoring Plugins
Package	Description
org.apache.nutch.scoring.depth	Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).
org.apache.nutch.scoring.link	Scoring filter used in conjunction with `WebGraph`.
org.apache.nutch.scoring.opic	Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
org.apache.nutch.scoring.similarity
org.apache.nutch.scoring.similarity.cosine	Implements the cosine similarity metric for scoring relevant documents
org.apache.nutch.scoring.similarity.util	Utility package for Lucene functions
org.apache.nutch.scoring.tld	Top Level Domain Scoring plugin.
org.apache.nutch.scoring.urlmeta	URL Meta Tag Scoring Plugin

Parse Plugins
Package	Description
org.apache.nutch.parse.ext	Parse wrapper to run external command to do the parsing.
org.apache.nutch.parse.feed	Parse RSS feeds.
org.apache.nutch.parse.html	An HTML document parsing plugin.
org.apache.nutch.parse.js	Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
org.apache.nutch.parse.swf	Parse Flash SWF files.
org.apache.nutch.parse.tika	Parse various document formats with help of Apache Tika.
org.apache.nutch.parse.zip	Parse ZIP files: embedded files are recursively passed to appropriate parsers.

Parse Filter Plugins
Package	Description
org.apache.nutch.parse.headings	Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
org.apache.nutch.parse.metatags	Parse filter to extract meta tags: keywords, description, etc.
org.apache.nutch.parsefilter.naivebayes	Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevent it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.
org.apache.nutch.parsefilter.regex	RegexParseFilter.

Indexing Filter Plugins
Package	Description
org.apache.nutch.indexer.anchor	An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic	A basic indexing plugin, adds basic fields: url, host, title, content, etc.
org.apache.nutch.indexer.feed	Indexing filter to index meta data from RSS feeds.
org.apache.nutch.indexer.filter
org.apache.nutch.indexer.geoip	This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
org.apache.nutch.indexer.metadata	Indexing filter to add document metadata to the index.
org.apache.nutch.indexer.more	A more indexing plugin, adds "more" index fields: last modified date, MIME type, content length.
org.apache.nutch.indexer.staticfield	A simple plugin called at indexing that adds fields with static data.
org.apache.nutch.indexer.subcollection	Indexing filter to assign documents to subcollections.
org.apache.nutch.indexer.tld	Top Level Domain Indexing plugin.
org.apache.nutch.indexer.urlmeta	URL Meta Tag Indexing Plugin

Indexer Plugins
Package	Description
org.apache.nutch.indexwriter.dummy	Index writer plugin for debugging, writes pairs of <action, url> to a text file, action is one of "add", "update", or "delete".
org.apache.nutch.indexwriter.elastic	Index writer plugin for Elasticsearch.
org.apache.nutch.indexwriter.solr	Index writer plugin for Apache Solr.

Misc. Plugins
Package	Description
org.apache.nutch.analysis.lang	Text document language identifier.
org.apache.nutch.collection	Subcollection is a subset of an index.
org.apache.nutch.microformats.reltag	A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.creativecommons.nutch	Sample plugins that parse and index Creative Commons medadata.

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.