org.apache.nutch.scoring.webgraph
Class WebGraph
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.scoring.webgraph.WebGraph
- All Implemented Interfaces:
- Configurable, Tool
public class WebGraph
- extends Configured
- implements Tool
Creates three databases, one for inlinks, one for outlinks, and a node
database that holds the number of in and outlinks to a url and the current
score for the url.
The score is set by an analysis program such as LinkRank. The WebGraph is an
update-able database. Outlinks are stored by their fetch time or by the
current system time if no fetch time is available. Only the most recent
version of outlinks for a given url is stored. As more crawls are executed
and the WebGraph updated, newer Outlinks will replace older Outlinks. This
allows the WebGraph to adapt to changes in the link structure of the web.
The Inlink database is created from the Outlink database and is regenerated
when the WebGraph is updated. The Node database is created from both the
Inlink and Outlink databases. Because the Node database is overwritten when
the WebGraph is updated and because the Node database holds current scores
for urls it is recommended that a crawl-cyle (one or more full crawls) fully
complete before the WebGraph is updated and some type of analysis, such as
LinkRank, is run to update scores in the Node database in a stable fashion.
Nested Class Summary |
static class |
WebGraph.OutlinkDb
The OutlinkDb creates a database of all outlinks. |
Method Summary |
void |
createWebGraph(Path webGraphDb,
Path[] segments)
Creates the three different WebGraph databases, Outlinks, Inlinks, and
Node. |
static void |
main(String[] args)
|
int |
run(String[] args)
Parses command link arguments and runs the WebGraph jobs. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.apache.commons.logging.Log LOG
LOCK_NAME
public static final String LOCK_NAME
- See Also:
- Constant Field Values
INLINK_DIR
public static final String INLINK_DIR
- See Also:
- Constant Field Values
OUTLINK_DIR
public static final String OUTLINK_DIR
- See Also:
- Constant Field Values
NODE_DIR
public static final String NODE_DIR
- See Also:
- Constant Field Values
WebGraph
public WebGraph()
createWebGraph
public void createWebGraph(Path webGraphDb,
Path[] segments)
throws IOException
- Creates the three different WebGraph databases, Outlinks, Inlinks, and
Node. If a current WebGraph exists then it is updated, if it doesn't exist
then a new WebGraph database is created.
- Parameters:
webGraphDb
- The WebGraph to create or update.segments
- The array of segments used to update the WebGraph. Newer
segments and fetch times will overwrite older segments.
- Throws:
IOException
- If an error occurs while processing the WebGraph.
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Parses command link arguments and runs the WebGraph jobs.
- Specified by:
run
in interface Tool
- Throws:
Exception
Copyright © 2006 The Apache Software Foundation