Data Extraction +---------------------------------------------------------------------------------------------- /*1*/ Any23 runner = new Any23(); /*2*/ runner.setHTTPUserAgent("test-user-agent"); /*3*/ HTTPClient httpClient = runner.getHTTPClient(); /*4*/ DocumentSource source = new HTTPDocumentSource( httpClient, "http://www.rentalinrome.com/semanticloft/semanticloft.htm" ); /*5*/ ByteArrayOutputStream out = new ByteArrayOutputStream(); /*6*/ TripleHandler handler = new NTriplesWriter(out); /*7*/ runner.extract(source, handler); /*8*/ String n3 = out.toString("UTF-8"); +---------------------------------------------------------------------------------------------- This second example demonstrates the data extraction, that is the main purpose of <> library. At <> we define the <> facade instance. As described before, the constructor allows to enforce the usage of specific extractors. The <> defines the , used to identify the client during data collection. At <> we use the runner to create an instance of {{{./xref/org/deri/any23/http/HTTPClient.html}HTTPClient}}, used by {{{./xref/org/deri/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} for content fetching. The <> instantiates an {{{./xref/org/deri/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} instance, specifying the {{{./xref/org/deri/any23/http/HTTPClient.html}HTTPClient}} and the URL addressing the content to be processed. At <> we define a buffered output stream used to store data produced by the {{{./xref/org/deri/any23/writer/TripleHandler.html}TripleHandler}} defined at <>. The extraction method at <> will run the metadata extraction. As discussed in the previous example it needs at least a {{{./xref/org/deri/any23/writer/TripleHandler.html}TripleHandler}} instance. The expected output is encoded at <> and is: +---------------------------------------------------------------------------------------------- "Semantic Loft (beta) - Trastevere apartments | Rental in Rome - rentalinrome.com" . . . . . _:node14r93a8dex1 . [The complete output is omitted for brevity.] +---------------------------------------------------------------------------------------------- Filter Out Accidental Triples To remove accidental triples <> provides a set of useful filters, located within the <> package. The filter {{{./xref/org/deri/any23/filter/IgnoreTitlesOfEmptyDocuments.html}IgnoreTitlesOfEmptyDocuments}} removes triples generated by the {{{./xref/org/deri/any23/extractor/html/TitleExtractor.html}TitleExtractor}} whether the document is empty. The filter {{{./xref/org/deri/any23/filter/IgnoreAccidentalRDFa.html}IgnoreAccidentalRDFa}} removes accidental <> related triples. +------------------------------------ RDFWriter rdfWriter = ... TripleHandler rdfWriterHandler = RDFWriterTripleHandler(rdfWriter); TripleHandler tripleHandler = new ReportingTripleHandler( new IgnoreAccidentalRDFa( new IgnoreTitlesOfEmptyDocuments(rdfWriterHandler), true // if true the CSS triples will be removed in any case. ) ); DocumentSource documentSource = ... any23.extract(documentSource, rdfWriterHandler); +------------------------------------