public class CommonCrawlDataDumper extends Object
The Common Crawl Data Dumper tool enables one to reverse generate the raw content from Nutch segment data directories into a common crawling data format, consumed by many applications. The data is then serialized as CBOR
Text content will be stored in a structured document format. Below is a schema for storage of data and metadata related to a crawling request, with the response body truncated for readability. This document must be encoded using CBOR and should be compressed with gzip after encoding. The timestamped URL key for these records' keys follows the same layout as the media file directory structure, with underscores in place of directory separators.
Thus, the timestamped url key for the record is provided below followed by an example record:
com_somepage_33a3e36bbef59c2a5242c2ccee59239ab30d51f3_1411623696000
{
"url": "http:\/\/somepage.com\/22\/14560817",
"timestamp": "1411623696000",
"request": {
"method": "GET",
"client": {
"hostname": "crawler01.local",
"address": "74.347.129.200",
"software": "Apache Nutch v1.10",
"robots": "classic",
"contact": {
"name": "Nutch Admin",
"email": "nutch.pro@nutchadmin.org"
}
},
"headers": {
"Accept": "text\/html,application\/xhtml+xml,application\/xml",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "en-US,en",
"User-Agent": "Mozilla\/5.0",
"...": "..."
},
"body": null
},
"response": {
"status": "200",
"server": {
"hostname": "somepage.com",
"address": "55.33.51.19",
},
"headers": {
"Content-Encoding": "gzip",
"Content-Type": "text\/html",
"Date": "Thu, 25 Sep 2014 04:16:58 GMT",
"Expires": "Thu, 25 Sep 2014 04:16:57 GMT",
"Server": "nginx",
"...": "..."
},
"body": "\r\n <!DOCTYPE html PUBLIC ... \r\n\r\n \r\n </body>\r\n </html>\r\n \r\n\r\n",
},
"key": "com_somepage_33a3e36bbef59c2a5242c2ccee59239ab30d51f3_1411623696000",
"imported": "1411623698000"
}
Upon successful completion the tool displays a very convenient JSON snippet detailing the mimetype classifications and the counts of documents which fall into those classifications. An example is as follows:
INFO: File Types:
TOTAL Stats: {
{"mimeType":"application/xml","count":19"}
{"mimeType":"image/png","count":47"}
{"mimeType":"image/jpeg","count":141"}
{"mimeType":"image/vnd.microsoft.icon","count":4"}
{"mimeType":"text/plain","count":89"}
{"mimeType":"video/quicktime","count":2"}
{"mimeType":"image/gif","count":63"}
{"mimeType":"application/xhtml+xml","count":1670"}
{"mimeType":"application/octet-stream","count":40"}
{"mimeType":"text/html","count":1863"}
}
Constructor and Description |
---|
CommonCrawlDataDumper(CommonCrawlConfig config)
Constructor
|
Modifier and Type | Method and Description |
---|---|
void |
dump(File outputDir,
File segmentRootDir,
boolean gzip,
String[] mimeTypes,
boolean epochFilename)
Dumps the reverse engineered CBOR content from the provided segment
directories if a parent directory contains more than one segment,
otherwise a single segment can be passed as an argument.
|
static void |
main(String[] args)
Main method for invoking this tool
|
static String |
reverseUrl(String urlString) |
public CommonCrawlDataDumper(CommonCrawlConfig config)
public static void main(String[] args) throws Exception
args
- 1) output directory (which will be created if it does not
already exist) to host the CBOR data and 2) a directory
containing one or more segments from which we wish to generate
CBOR data from. Optionally, 3) a list of mimetypes and the 4)
the gzip option may be provided.Exception
public void dump(File outputDir, File segmentRootDir, boolean gzip, String[] mimeTypes, boolean epochFilename) throws Exception
outputDir
- the directory you wish to dump the raw content to. This
directory will be created.segmentRootDir
- a directory containing one or more segments.gzip
- a boolean flag indicating whether the CBOR content should also
be gzipped.mimetypes
- an array of mime types we have to dump, all others will be
filtered out.Exception
Copyright © 2015 The Apache Software Foundation