A p a c h e    D r o i d s 
                        --------------------------

               by Thorsten Scherler <thorsten at apache.org>
               
+-----------------------------------------------------------+    
| HEADSUP:                                                  |
| !!! Please ONLY crawl localhost NEVER a internet site!!!  |
| The first implementation does not follow robots rules!    |
| DO NOT TRY ANY APACHE SITE!                               |
+-----------------------------------------------------------+
+-----------------------------------------------------------+
| HEADSUP:                                                  |
| The undelying api is in PoC state meaning totally open    |
| for discussion and HIGHLY likely to change whithout       |
| further notice.                                           |
|                                                           |
| The plugin implemented are as well in PoC state.          |
|                                                           |
| If you a brave soul subscribe to labs@labs.apache.org     |
| and let us know that you are developing with droids.      |
| We then will try to stabilize the api with your feedback. |
+-----------------------------------------------------------+
               
 What is this?
 -------------
 Droids aims to be an intelligent standalone robot
 framework that allows to create robots as plugins, which can automatically seeks out
 relevant online information based on the user's specifications. For the core I took
 nutch, ripped out and modified the awesome plugin/extension framework. Droids makes
 it very easy to extend robots or write a new one. The fist implementation is
 crawler-x-m02y07 - a simple crawler which is easily extendable by plugins. If a
 project/app needs special processing for a crawled url one can write some plugins and
 use an existing crawler to implement the functionality or one can write a new crawler
 which is very easy.

 Why was it created?
 -------------------
 Mainly because of personal curiosity:
 The background of this work is that Cocoon trunk does not provide a
 crawler anymore and Forrest is based on it, meaning we cannot update
 anymore till we found a crawler replacement. Getting more involved in
 Solr and Nutch I see request for a generic standalone crawler. 
 
 How does the first implementation crawler-x-m02y07 looks like?
 --------------------------------------------------------------
 I wrote some proof of concept plugins that make up crawler-x-m02y07 to 
 - crawl an url 
 - extract links (only <a/> ATM) via a parse-html plugin
 - merge them with the queue
 - save or print out the crawled pages.
 
 Why crawler-x-m02y07?
 ---------------------
 Droids tries to be a framework for different droids. 
 The first implementation is a "crawler" with the name "x"
 first archived in the second "m"onth of the "y"ear 20"07"
  
 Requirements
 ************
* Apache Ant version 1.6.5
** copy ./tools/ivy/i
* JDK 1.5 or higher
** If using JDK 1.5: 
** cd lib/ 
** wget http://www.ibiblio.org/maven2/stax/stax-api/1.0/stax-api-1.0.jar 

 Running
 *******
 Build: ant
 Initial url: echo "droids.initial.url=http://localhost/index.html">build.properties
 Run: ant crawl
 
 HEADSUP
 *******
 !!! Please ONLY crawl localhost NEVER a internet site!!!
 The parse-html plugin assumes that the incoming stream is valid xml!
 You will need to adjust the urlfilters to limit loops. 
 
 Links
 -----
 http://lucene.apache.org/nutch/ - Nutch web-search software
 http://www.robotstxt.org/wc/robots.html - The Web Robots Pages
 http://www.few.vu.nl/~andreas/programming/webcrawler/index.html - How to write a multi-threaded webcrawler
 http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/ - Writing a Web Crawler in the Java Programming Language

 Outro
 -----
 Hope you enjoy. 
 
 Please report feedback to the labs mailing list.
 TIA
 
 salu2
 thorsten