public class FileDumper extends Object
The file dumper tool enables one to reverse generate the raw content from Nutch segment data directories.
The tool has a number of immediate uses:
Upon successful completion the tool displays a very convenient JSON snippet detailing the mimetype classifications and the counts of documents which fall into those classifications. An example is as follows:
INFO: File Types: TOTAL Stats: [ {"mimeType":"application/xml","count":"19"} {"mimeType":"image/png","count":"47"} {"mimeType":"image/jpeg","count":"141"} {"mimeType":"image/vnd.microsoft.icon","count":"4"} {"mimeType":"text/plain","count":"89"} {"mimeType":"video/quicktime","count":"2"} {"mimeType":"image/gif","count":"63"} {"mimeType":"application/xhtml+xml","count":"1670"} {"mimeType":"application/octet-stream","count":"40"} {"mimeType":"text/html","count":"1863"} ] FILTER Stats: [ {"mimeType":"image/png","count":"47"} {"mimeType":"image/jpeg","count":"141"} {"mimeType":"image/vnd.microsoft.icon","count":"4"} {"mimeType":"video/quicktime","count":"2"} {"mimeType":"image/gif","count":"63"} ]
In the case above, the tool would have been run with the -mimeType image/png image/jpeg image/vnd.microsoft.icon video/quicktime image/gif flag and corresponding values activated.
Constructor and Description |
---|
FileDumper() |
Modifier and Type | Method and Description |
---|---|
void |
dump(File outputDir,
File segmentRootDir,
String[] mimeTypes)
Dumps the reverse engineered raw content from the provided segment
directories if a parent directory contains more than one segment, otherwise
a single segment can be passed as an argument.
|
static void |
main(String[] args)
Main method for invoking this tool
|
public void dump(File outputDir, File segmentRootDir, String[] mimeTypes) throws Exception
outputDir
- the directory you wish to dump the raw content to. This directory
will be created.segmentRootDir
- a directory containing one or more segments.mimeTypes
- an array of mime types we have to dump, all others will be
filtered out.Exception
Copyright © 2015 The Apache Software Foundation