Parser checker, useful for testing parser. It also accurately reports
possible fetching and parsing failures and presents protocol status signals
to aid debugging. The tool enables us to retrieve the following data from any
url:
- contentType: The URL
Content
type.
- signature: Digest is used to identify pages (like unique ID) and
is used to remove duplicates during the dedup procedure. It is calculated
using
MD5Signature
or
TextProfileSignature
.
- Version: From
ParseData
.
- Status: From
ParseData
.
- Title: of the URL
- Outlinks: associated with the URL
- Content Metadata: such as X-AspNet-Version, Date,
Content-length, servedBy, Content-Type,
Cache-Control>, etc.
- Parse Metadata: such as CharEncodingForConversion,
OriginalCharEncoding, language, etc.
- ParseText: The page parse text which varies in length depdnecing
on
content.length
configuration.