Parser checker, useful for testing parser.
It also accurately reports possible fetching and
parsing failures and presents protocol status signals to aid
debugging. The tool enables us to retrieve the following data from
any url:
- contentType: The URL
Content
type.
- signature: Digest is used to identify pages (like unique ID) and is used to remove
duplicates during the dedup procedure.
It is calculated using
MD5Signature
or
TextProfileSignature
.
- Version: From
ParseData
.
- Status: From
ParseData
.
- Title: of the URL
- Outlinks: associated with the URL
- Content Metadata: such as X-AspNet-Version, Date,
Content-length, servedBy, Content-Type, Cache-Control>, etc.
- Parse Metadata: such as CharEncodingForConversion,
OriginalCharEncoding, language, etc.
- ParseText: The page parse text which varies in length depdnecing on
content.length
configuration.