------ Apache Any23 - Developers Guide ------ The Apache Software Foundation ------ 2011-2012 ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Architectural Overview [./images/any23-overall.png] The informal architectural diagram above shows the <> logical modules, the main data flow and the code packages implementing such modules. The first module, <>, is responsible for retrieving raw data from the Web, its implementation package is <>. The data collected by is analyzed by the <> module, implemented in package <>. Such module will determine the data encoding and the content type. The identification of the MIME type is used to select a list of activable for the subsequent metadata extraction. The next phase is performed by the <> module (<>), and it is required because the most part of data exposed on the Web is affected by minor issues which compromise the correct working of some . To overcome such problems <> introduced a mechanism to detect issues and in most cases to fix them. The detection and fixing is performed using an extensible collection of <>. Currently the Validation and Patching is applied only on based documents (). The <> module, implemented within the <> package, applies all the activated by the analysis phase and generates an RDF statements stream together with an issue report. The statements produced by the s can be filtered to remove spurious, repeated or unwanted triples using the <> module (<>). The last metadata extraction phase consists in the conversion of the filtered statements in an RDF representation format. This can be done by using one of the available RDF writers provided by the <> module (<>). The other modules represented at the bottom of the diagram add auxiliary functionalities over the core application. The <> module (<>) is responsible for the extension of the platform through the runtime detection and registration of additional components included within the classpath. The Plugin Manager is currently able to detect and register new Extractors, Writers and CLI tools. It is foreseen the plugin support implementation for all the modules marked as (P). The <> module (org.apache.any23.cli) allows to run all the available CLI tools through a unified interface. The <> module (org.apache.any23.service) implements a REST service to use Any23 as a Web service implementing a interface. Developers Guide This section introduces some <> programming fundamentals. * {{{./dev-data-extraction.html}Data Extraction}} Explains how to extract RDF data from HTTP resources with <>. * {{{./dev-data-conversion.html}Data Conversion}} Shows how to perform RDF data conversion with <>. * {{{./dev-validation-fix.html}Validation and Fixing}} Demonstrates how to define validation and correction rules for HTML content with <>. * {{{./dev-xpath-extractor.html}XPath Extractor}} Explains how to write custom scraping rules for extracting RDF data from any HTML content with <>. * {{{./dev-microformat-extractors.html}Microformat Extractors}} Explains how to write new Microformat extractors with <> and also report interesting notes on microformat nesting representation. * {{{./dev-microdata-extractor.html}Microdata Extractor}} Explains how it works the Microdata Extractor embedded in <>. * {{{./dev-csv-extractor.html}CSV Extractor}} Explains how it works the CSV Extractor embedded in <>.