NUTCH-840 : moved tests to parse/tika + added TestDOMContentUtil which currently fail but will help us track the progress on the Tika processing of HTML