A p a c h e D r o i d s -------------------------- by Thorsten Scherler +-----------------------------------------------------------+ | HEADSUP: | | !!! Please ONLY crawl localhost NEVER a internet site!!! | | The first implementation does not follow robots rules! | | DO NOT TRY ANY APACHE SITE! | +-----------------------------------------------------------+ +-----------------------------------------------------------+ | HEADSUP: | | The undelying api is in PoC state meaning totally open | | for discussion and HIGHLY likely to change whithout | | further notice. | | | | The plugin implemented are as well in PoC state. | | | | If you a brave soul subscribe to labs@labs.apache.org | | and let us know that you are developing with droids. | | We then will try to stabilize the api with your feedback. | +-----------------------------------------------------------+ What is this? ------------- Droids aims to be an intelligent standalone robot framework that allows to create robots as plugins, which can automatically seeks out relevant online information based on the user's specifications. For the core I took nutch, ripped out and modified the awesome plugin/extension framework. Droids makes it very easy to extend robots or write a new one. The fist implementation is crawler-x-m02y07 - a simple crawler which is easily extendable by plugins. If a project/app needs special processing for a crawled url one can write some plugins and use an existing crawler to implement the functionality or one can write a new crawler which is very easy. Why was it created? ------------------- Mainly because of personal curiosity: The background of this work is that Cocoon trunk does not provide a crawler anymore and Forrest is based on it, meaning we cannot update anymore till we found a crawler replacement. Getting more involved in Solr and Nutch I see request for a generic standalone crawler. How does the first implementation crawler-x-m02y07 looks like? -------------------------------------------------------------- I wrote some proof of concept plugins that make up crawler-x-m02y07 to - crawl an url - extract links (only ATM) via a parse-html plugin - merge them with the queue - save or print out the crawled pages. Why crawler-x-m02y07? --------------------- Droids tries to be a framework for different droids. The first implementation is a "crawler" with the name "x" first archived in the second "m"onth of the "y"ear 20"07" Requirements ************ * Apache Ant version 1.6.5 ** copy ./tools/ivy/i * JDK 1.5 or higher ** If using JDK 1.5: ** cd lib/ ** wget http://www.ibiblio.org/maven2/stax/stax-api/1.0/stax-api-1.0.jar Running ******* Build: ant Initial url: echo "droids.initial.url=http://localhost/index.html">build.properties Run: ant crawl HEADSUP ******* !!! Please ONLY crawl localhost NEVER a internet site!!! The parse-html plugin assumes that the incoming stream is valid xml! You will need to adjust the urlfilters to limit loops. Links ----- http://lucene.apache.org/nutch/ - Nutch web-search software http://www.robotstxt.org/wc/robots.html - The Web Robots Pages http://www.few.vu.nl/~andreas/programming/webcrawler/index.html - How to write a multi-threaded webcrawler http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/ - Writing a Web Crawler in the Java Programming Language Outro ----- Hope you enjoy. Please report feedback to the labs mailing list. TIA salu2 thorsten