Apache Any23 can be used:
Apache Any23 is composed of the following modules:
The command-line tools support is provided by the any23-core module.
Once Apache Any23 has been correctly installed, if you want to use it as a command line tool, use the shell script within the any23-core/bin directory. These are provided both for Unix (Linux/OSX) and Windows.
The any23 script provides analysis, documentation, testing and debugging utilities.
Simply running ./any23 without options will show the usage options.
any23-core$ ./bin/any23 A command must be specified. Usage: any23 [options] [command] [command options] Options: -h, --help Display help information. Default: false --plugins-dir The Any23 plugins directory. Default: ~/.any23/plugins -X, --verbose Produce execution verbose output. Default: false -v, --version Display version information. Default: false Commands: extractor Utility for obtaining documentation about metadata extractors. Usage: extractor [options] Extractor name Options: -a, --all shows a report about all available extractors Default: false -i, --input shows example input for the given extractor Default: false -l, --list shows the names of all available extractors Default: false -o, --outut shows example output for the given extractor Default: false microdata Commandline Tool for extracting Microdata from file/HTTP source. Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/local.file} mimes MIME Type Detector Tool. Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content} verify Utility for plugin management verification. Usage: verify [options] plugins-dir rover Any23 Command Line Tool. Usage: rover [options] input URIs {<url>|<file>}+ Options: -d, --defaultns Override the default namespace used to produce statements. -e, --extractors a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle Default: [] -f, --format the output format Default: turtle -l, --log Produce log within a file. -n, --nesting Disable production of nesting triples. Default: false -t, --notrivial Filter trivial statements (e.g. CSS related ones). Default: false -o, --output Specify Output file (defaults to standard output) Default: java.io.PrintStream@79dfc547 -p, --pedantic Validate and fixes HTML content detecting commons issues. Default: false -s, --stats Print out extraction statistics. Default: false vocab Prints out the RDF Schema of the vocabularies used by Any23. Usage: vocab [options] Options: -f, --format Vocabulary output format Default: NQuads
The any23 script detects a list of available utilities within the any23-core and plugins classpath and allows to activate them.
The any23-core CLI tools are:
Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP) resources, specify a custom list of extractors, specify the desired output format and other flags to suppress noise and generate advanced reports.
Extract metadata from an HTML page:
any23-core$ ./bin/any23 rover http://yourdomain/yourfile
Extract metadata from a local resource:
any23-core$ ./bin/any23 rover myfoaf.rdf
Specify the output format, use the option "-f" or "--format": (Default output format is TURTLE).
any23-core$ ./bin/any23 rover -f quad myfoaf.rdf
Filtering trivial statements
By default, Apache Any23 will extract HTML/head meta information, such as links to CSS stylesheets or meta information like the author or the software used to create the html. Hence, if the user is only interested in the structured content from the HTML/body tag we offer a filter functionality, activated by the "-t" command line argument.
any23-core$ ./bin/any23 rover -t -f quad myfoaf.rdf
The ExtractorDocumentation returns human readable information about the registered extractors.
List all the available extractors:
any23-core/core$ ./bin/any23 extractor --list csv [class org.apache.any23.extractor.csv.CSVExtractor] html-head-icbm [class org.apache.any23.extractor.html.ICBMExtractor] html-head-links [class org.apache.any23.extractor.html.HeadLinkExtractor] html-head-title [class org.apache.any23.extractor.html.TitleExtractor] html-mf-adr [class org.apache.any23.extractor.html.AdrExtractor] html-mf-geo [class org.apache.any23.extractor.html.GeoExtractor] html-mf-hcalendar [class org.apache.any23.extractor.html.HCalendarExtractor] html-mf-hcard [class org.apache.any23.extractor.html.HCardExtractor] html-mf-hlisting [class org.apache.any23.extractor.html.HListingExtractor] html-mf-hrecipe [class org.apache.any23.extractor.html.HRecipeExtractor] html-mf-hresume [class org.apache.any23.extractor.html.HResumeExtractor] html-mf-hreview [class org.apache.any23.extractor.html.HReviewExtractor] html-mf-license [class org.apache.any23.extractor.html.LicenseExtractor] html-mf-species [class org.apache.any23.extractor.html.SpeciesExtractor] html-mf-xfn [class org.apache.any23.extractor.html.XFNExtractor] html-microdata [class org.apache.any23.extractor.microdata.MicrodataExtractor] html-rdfa11 [class org.apache.any23.extractor.rdfa.RDFa11Extractor] html-script-turtle [class org.apache.any23.extractor.html.TurtleHTMLExtractor] rdf-nq [class org.apache.any23.extractor.rdf.NQuadsExtractor] rdf-nt [class org.apache.any23.extractor.rdf.NTriplesExtractor] rdf-trix [class org.apache.any23.extractor.rdf.TriXExtractor] rdf-turtle [class org.apache.any23.extractor.rdf.TurtleExtractor] rdf-xml [class org.apache.any23.extractor.rdf.RDFXMLExtractor]
The MicrodataParser tool allows to apply the only MicrodataExtractor on a specific input source and returns the extracted data in the JSON format declared in the Microdata specification section JSON.
any23-core/core$ ./bin/any23 microdata http://path/to/resource.html
The VocabPrinter Tool prints out the RDFSchema declared by all the Apache Any23 declared vocabularies.
Just launch the command below to see all the managed vocabularies.
any23-core/core$ ./bin/any23 vocab
NOTE: This tool is still in beta version.
The MimeDetector Tool extracts the MIME Type for a given source (http:// file:// inline://).
Examples:
any23-core$ ./bin/any23 mimes http://www.michelemostarda.com/foaf.rdf application/rdf+xml
any23-core$ ./bin/any23 mimes file://../src/test/resources/application/trix/test1.trx application/trix
any23-core$ ./bin/any23 mimes 'inline://<http://s> <http://p> <http://o> .' text/n3
The Apache Any23 ToolRunner CLI (bin/any23tools) supports the auto detection of Tool plugins within the classpath. For further details see Plugins section.
The default any23 CLI plugins are enlisted below.
crawler-tool The Crawler Plugin provides basic site crawling and metadata extraction capabilities.
any23-core$ ./bin/any23 -h [...] crawler Any23 Crawler Command Line Tool. Usage: crawler [options] input URIs {<url>|<file>}+ Options: -d, --defaultns Override the default namespace used to produce statements. -e, --extractors a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle Default: [] -f, --format the output format Default: turtle -l, --log Produce log within a file. -md, --maxdepth Max allowed crawler depth. Default: 2147483647 -mp, --maxpages Max number of pages before interrupting crawl. Default: 2147483647 -n, --nesting Disable production of nesting triples. Default: false -t, --notrivial Filter trivial statements (e.g. CSS related ones). Default: false -nc, --numcrawlers Sets the number of crawlers. Default: 10 -o, --output Specify Output file (defaults to standard output) Default: java.io.PrintStream@2911a3a4 -pf, --pagefilter Regex used to filter out page URLs during crawling. Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$ -p, --pedantic Validate and fixes HTML content detecting commons issues. Default: false -pd, --politenessdelay Politeness delay in milliseconds. Default: 2147483647 -s, --stats Print out extraction statistics. Default: false -sf, --storagefolder Folder used to store crawler temporary data. Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce
A usage example:
any23-core$ ./bin/any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log
Apache Any23 provides a Web Service that can be used to extract RDF from Web documents. Apache Any23 services can be accessed through a RESTful API.
Running the server
The server command line tool is defined within the any23-service module. Run the any23server script
any23-service$ ./bin/any23server
from the command line in order to start up the server, then go to to access the web interface. A live demo version of such service is running at . You can also start the server from Java by running the Apache Any23 Servlet class. Maven can be used to create a WAR file for deployment into an existing servlet container such as Apache Tomcat.
See our Developers guide for more details.