Apache Any23 - Getting started

Getting started with Apache Any23

Apache Any23 can be used:

via CLI (command line interface) from your preferred shell environment; * as a RESTful Webservice; * as a library.

Apache Any23 Modules

Apache Any23 is composed of the following modules:

any23-core/ The core library.
any23-service/ The REST service.
any23-plugins/ The core additional plugins.

Use the Apache Any23 CLI

The command-line tools support is provided by the any23-core module.

Once Apache Any23 has been correctly installed, if you want to use it as a command line tool, use the shell script within the any23-core/bin directory. These are provided both for Unix (Linux/OSX) and Windows.

The any23 script provides analysis, documentation, testing and debugging utilities.

Simply running ./any23 without options will show the usage options.

any23-core$ ./bin/any23
A command must be specified.
Usage: any23 [options] [command] [command options]
Options:
-h, --help Display help information.
Default: false
--plugins-dir The Any23 plugins directory.
Default: ~/.any23/plugins
-X, --verbose Produce execution verbose output.
Default: false
-v, --version Display version information.
Default: false
Commands:
extractor Utility for obtaining documentation about metadata extractors.
Usage: extractor [options] Extractor name
Options:
-a, --all shows a report about all available extractors
Default: false
-i, --input shows example input for the given extractor
Default: false
-l, --list shows the names of all available extractors
Default: false
-o, --outut shows example output for the given extractor
Default: false

microdata Commandline Tool for extracting Microdata from file/HTTP source.
Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/local.file}
mimes MIME Type Detector Tool.
Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content}
verify Utility for plugin management verification.
Usage: verify [options] plugins-dir
rover Any23 Command Line Tool.
Usage: rover [options] input URIs {<url>|<file>}+
Options:
-d, --defaultns Override the default namespace used to produce
statements.
-e, --extractors a comma-separated list of extractors, e.g.
rdf-xml,rdf-turtle
Default: []
-f, --format the output format
Default: turtle
-l, --log Produce log within a file.
-n, --nesting Disable production of nesting triples.
Default: false
-t, --notrivial Filter trivial statements (e.g. CSS related ones).
Default: false
-o, --output Specify Output file (defaults to standard output)
Default: java.io.PrintStream@79dfc547
-p, --pedantic Validate and fixes HTML content detecting commons
issues.
Default: false
-s, --stats Print out extraction statistics.
Default: false

vocab Prints out the RDF Schema of the vocabularies used by Any23.
Usage: vocab [options]
Options:
-f, --format Vocabulary output format
Default: NQuads

The any23 script detects a list of available utilities within the any23-core and plugins classpath and allows to activate them.

The any23-core CLI tools are:

extractor: a utility for obtaining useful information about extractors.
microdata: commandline parser to extract specific Microdata content from a web page (local or remote) and produce a JSON output compliant with the Microdata specification (http://www.w3.org/TR/microdata/).
mimes: detects the MIME Type for any HTTP / file / direct input resource.
verify: a utility for verifying Apache Any23 plugins.
rover: the RDF extraction tool.
vocab: allows to dump all the RDFSchema vocabularies declared within Apache Any23.

The Rover tool

Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP) resources, specify a custom list of extractors, specify the desired output format and other flags to suppress noise and generate advanced reports.

Extract metadata from an HTML page:

any23-core$ ./bin/any23 rover http://yourdomain/yourfile

Extract metadata from a local resource:

any23-core$ ./bin/any23 rover myfoaf.rdf

Specify the output format, use the option "-f" or "--format": (Default output format is TURTLE).

any23-core$ ./bin/any23 rover -f quad myfoaf.rdf

Filtering trivial statements

By default, Apache Any23 will extract HTML/head meta information, such as links to CSS stylesheets or meta information like the author or the software used to create the html. Hence, if the user is only interested in the structured content from the HTML/body tag we offer a filter functionality, activated by the "-t" command line argument.

any23-core$ ./bin/any23 rover -t -f quad myfoaf.rdf

The ExtractorDocumentation tool

The ExtractorDocumentation returns human readable information about the registered extractors.

List all the available extractors:

any23-core/core$ ./bin/any23 extractor --list
                      csv [class org.apache.any23.extractor.csv.CSVExtractor]
           html-head-icbm [class org.apache.any23.extractor.html.ICBMExtractor]
          html-head-links [class org.apache.any23.extractor.html.HeadLinkExtractor]
          html-head-title [class org.apache.any23.extractor.html.TitleExtractor]
              html-mf-adr [class org.apache.any23.extractor.html.AdrExtractor]
              html-mf-geo [class org.apache.any23.extractor.html.GeoExtractor]
        html-mf-hcalendar [class org.apache.any23.extractor.html.HCalendarExtractor]
            html-mf-hcard [class org.apache.any23.extractor.html.HCardExtractor]
         html-mf-hlisting [class org.apache.any23.extractor.html.HListingExtractor]
          html-mf-hrecipe [class org.apache.any23.extractor.html.HRecipeExtractor]
          html-mf-hresume [class org.apache.any23.extractor.html.HResumeExtractor]
          html-mf-hreview [class org.apache.any23.extractor.html.HReviewExtractor]
          html-mf-license [class org.apache.any23.extractor.html.LicenseExtractor]
          html-mf-species [class org.apache.any23.extractor.html.SpeciesExtractor]
              html-mf-xfn [class org.apache.any23.extractor.html.XFNExtractor]
           html-microdata [class org.apache.any23.extractor.microdata.MicrodataExtractor]
              html-rdfa11 [class org.apache.any23.extractor.rdfa.RDFa11Extractor]
       html-script-turtle [class org.apache.any23.extractor.html.TurtleHTMLExtractor]
                   rdf-nq [class org.apache.any23.extractor.rdf.NQuadsExtractor]
                   rdf-nt [class org.apache.any23.extractor.rdf.NTriplesExtractor]
                 rdf-trix [class org.apache.any23.extractor.rdf.TriXExtractor]
               rdf-turtle [class org.apache.any23.extractor.rdf.TurtleExtractor]
                  rdf-xml [class org.apache.any23.extractor.rdf.RDFXMLExtractor]

The MicrodataParser tool

The MicrodataParser tool allows to apply the only MicrodataExtractor on a specific input source and returns the extracted data in the JSON format declared in the Microdata specification section JSON.

any23-core/core$ ./bin/any23 microdata http://path/to/resource.html

The VocabPrinter tool

The VocabPrinter Tool prints out the RDFSchema declared by all the Apache Any23 declared vocabularies.

Just launch the command below to see all the managed vocabularies.

any23-core/core$ ./bin/any23 vocab

NOTE: This tool is still in beta version.

The MimeDetector tool

The MimeDetector Tool extracts the MIME Type for a given source (http:// file:// inline://).

Examples:

any23-core$ ./bin/any23 mimes http://www.michelemostarda.com/foaf.rdf
application/rdf+xml

any23-core$ ./bin/any23 mimes file://../src/test/resources/application/trix/test1.trx
application/trix

any23-core$ ./bin/any23 mimes 'inline://<http://s> <http://p> <http://o> .'
text/n3

The PluginVerifier tool

The PluginVerifier tool allows checking installed plugin in the specified input directory

Just launch the command below to sanity-check the input plugins directory

any23-core$ ./bin/any23 verify [/path/to/plugins/dir]

Apache Any23 CLI Plugins

The Apache Any23 ToolRunner CLI (bin/any23tools) supports the auto detection of Tool plugins within the classpath. For further details see Plugins section.

The default any23 CLI plugins are enlisted below.

Crawler Plugin

crawler-tool The Crawler Plugin provides basic site crawling and metadata extraction capabilities.

any23-core$ ./bin/any23 -h
[...]
    crawler      Any23 Crawler Command Line Tool.
      Usage: crawler [options] input URIs {<url>|<file>}+
  Options:
          -d, --defaultns          Override the default namespace used to
                                   produce statements.
          -e, --extractors         a comma-separated list of extractors, e.g.
                                   rdf-xml,rdf-turtle
                                   Default: []
          -f, --format             the output format
                                   Default: turtle
          -l, --log                Produce log within a file.
          -md, --maxdepth          Max allowed crawler depth.
                                   Default: 2147483647
          -mp, --maxpages          Max number of pages before interrupting
                                   crawl.
                                   Default: 2147483647
          -n, --nesting            Disable production of nesting triples.
                                   Default: false
          -t, --notrivial          Filter trivial statements (e.g. CSS related
                                   ones).
                                   Default: false
          -nc, --numcrawlers       Sets the number of crawlers.
                                   Default: 10
          -o, --output             Specify Output file (defaults to standard
                                   output)
                                   Default: java.io.PrintStream@2911a3a4
          -pf, --pagefilter        Regex used to filter out page URLs during
                                   crawling.
                                   Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$
          -p, --pedantic           Validate and fixes HTML content detecting
                                   commons issues.
                                   Default: false
          -pd, --politenessdelay   Politeness delay in milliseconds.
                                   Default: 2147483647
          -s, --stats              Print out extraction statistics.
                                   Default: false
          -sf, --storagefolder     Folder used to store crawler temporary data.
                                   Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce

A usage example:

any23-core$ ./bin/any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log

Use Apache Any23 as a RESTful Web Service

Apache Any23 provides a Web Service that can be used to extract RDF from Web documents. Apache Any23 services can be accessed through a RESTful API.

Running the server

The server command line tool is defined within the any23-service module. Run the any23server script

any23-service$ ./bin/any23server

from the command line in order to start up the server, then go to to access the web interface. A live demo version of such service is running at . You can also start the server from Java by running the Apache Any23 Servlet class. Maven can be used to create a WAR file for deployment into an existing servlet container such as Apache Tomcat.

Use Apache Any23 as a Library

See our Developers guide for more details.