------
                                    Apache Any23 - Plugins - Basic Crawler
                                    ------
                              The Apache Software Foundation
                                    ------
                                     2011-2012

~~  Licensed to the Apache Software Foundation (ASF) under one or more
~~  contributor license agreements.  See the NOTICE file distributed with
~~  this work for additional information regarding copyright ownership.
~~  The ASF licenses this file to You under the Apache License, Version 2.0
~~  (the "License"); you may not use this file except in compliance with
~~  the License.  You may obtain a copy of the License at
~~
~~     http://www.apache.org/licenses/LICENSE-2.0
~~
~~  Unless required by applicable law or agreed to in writing, software
~~  distributed under the License is distributed on an "AS IS" BASIS,
~~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~  See the License for the specific language governing permissions and
~~  limitations under the License.

Basic Crawler Plugin

  The <Basic Crawler Plugin> implements a <CLI> {{{./xref/org/apache/any23/cli/Tool.html}Tool}} extending
  {{{./xref/org/apache/any23/cli/Rover.html}Rover}} to add <site crawling> capabilities.

  The tool can be used to extract semantic content from a small/medium size sites.

  To use it make sure to have correctly configured the basic-crawler plugin to be found by the
  <any23tools> script (follow the {{{./any23-plugins.html}Plugins}} section instructions):

+--------------------------------------------------------------
core/bin/$ ./any23tools Crawler
usage: [{<url>|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>]
       [-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o
       <arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s]
       [-storagefolder <arg>] [-t] [-v]
 -d,--defaultns <arg>       Override the default namespace used to produce
                            statements.
 -e <arg>                   Specify a comma-separated list of extractors,
                            e.g. rdf-xml,rdf-turtle.
 -f,--Output format <arg>   [turtle (default), rdfxml, ntriples, nquads,
                            trix, json, uri]
 -h,--help                  Print this help.
 -l,--log <arg>             Produce log within a file.
 -maxdepth <arg>            Max allowed crawler depth. Default: no limit.
 -maxpages <arg>            Max number of pages before interrupting crawl.
                            Default: no limit.
 -n,--nesting               Disable production of nesting triples.
 -numcrawlers <arg>         Sets the number of crawlers. Default: 10
 -o,--output <arg>          Specify Output file (defaults to standard
                            output).
 -p,--pedantic              Validate and fixes HTML content detecting
                            commons issues.
 -pagefilter <arg>          Regex used to filter out page URLs during
                            crawling. Default:
                            '.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|
                            mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm
                            il|pdf|swf|zip|rar|gz|xml|txt))$'
 -politenessdelay <arg>     Politeness delay in milliseconds. Default: no
                            limit.
 -s,--stats                 Print out extraction statistics.
 -storagefolder <arg>       Folder used to store crawler temporary data.
                            Default:
                            [/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g
                            q/T/]
 -t,--notrivial             Filter trivial statements (e.g. CSS related
                            ones).
 -v,--verbose               Show debug and progress information.
+--------------------------------------------------------------