------ Apache Any23 - Plugins - Basic Crawler ------ The Apache Software Foundation ------ 2011-2012 ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Basic Crawler Plugin The implements a {{{./xref/org/apache/any23/cli/Tool.html}Tool}} extending {{{./xref/org/apache/any23/cli/Rover.html}Rover}} to add capabilities. The tool can be used to extract semantic content from a small/medium size sites. To use it make sure to have correctly configured the basic-crawler plugin to be found by the script (follow the {{{./any23-plugins.html}Plugins}} section instructions): +-------------------------------------------------------------- core/bin/$ ./any23tools Crawler usage: [{|}]+ [-d ] [-e ] [-f ] [-h] [-l ] [-maxdepth ] [-maxpages ] [-n] [-numcrawlers ] [-o ] [-p] [-pagefilter ] [-politenessdelay ] [-s] [-storagefolder ] [-t] [-v] -d,--defaultns Override the default namespace used to produce statements. -e Specify a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle. -f,--Output format [turtle (default), rdfxml, ntriples, nquads, trix, json, uri] -h,--help Print this help. -l,--log Produce log within a file. -maxdepth Max allowed crawler depth. Default: no limit. -maxpages Max number of pages before interrupting crawl. Default: no limit. -n,--nesting Disable production of nesting triples. -numcrawlers Sets the number of crawlers. Default: 10 -o,--output Specify Output file (defaults to standard output). -p,--pedantic Validate and fixes HTML content detecting commons issues. -pagefilter Regex used to filter out page URLs during crawling. Default: '.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2| mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm il|pdf|swf|zip|rar|gz|xml|txt))$' -politenessdelay Politeness delay in milliseconds. Default: no limit. -s,--stats Print out extraction statistics. -storagefolder Folder used to store crawler temporary data. Default: [/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g q/T/] -t,--notrivial Filter trivial statements (e.g. CSS related ones). -v,--verbose Show debug and progress information. +--------------------------------------------------------------