Jakarta: SCRAPE JSP Tag Library

Version 1.0

Table of Contents

Overview
Requirements
Configuration
Tag Summary
Tag Reference
Examples
Javadocs
Revision History

Overview

The scrape tag library can scrape or extract content from web documents and display the content in your JSP. For example, you could scrape stock quotes from other web sites and display them in your pages.

After your JSP scrapes a document for the first time, the results of the scrape are cached for subsequent JSP requests. These results are returned unless the JSP determines that the document must be rescraped. Rescraping is determined by the following logic:

  1. The status of the scrape tags and attributes in the JSP is examined. Any modifications to the tags or attributes trigger a rescrape. If the tags have not been modified, the JSP proceeds to step 2.
  2. The minimum time for rescraping, specified by the time attribute of the page tag, is examined. The default time is 10 minutes. If this time has not passed since the last scrape, cached results are returned. If this time has passed, the JSP proceeds to step 3.
  3. The expired header of the scraped document is examined. If the expiration date/time has not passed, cached results are returned. If the expiration date/time is not specified or the document has expired, the JSP proceeds to step 4.
  4. The headers for the scraped document are requested and examined. If the document has not been modified since the last scrape, cached results are returned. If the document has been modified, it is rescraped and the new results are returned.

Requirements

This custom tag library requires no software other than a servlet container that supports the JavaServer Pages Specification, version 1.1.

In addition to the scrape.jar file, you must also have the jakarta-oro-2.0.2-dev-2.jar file. These files are included in the tag library download.

Configuration

Follow these steps to configure your web application with this tag library:

To use the tags from this library in your JSP pages, add the following directive at the top of each page:

<%@ taglib uri="scrape.jar" prefix="scrp" %>

where "scrp" is the tag name prefix you wish to use for tags from this library. You can change this value to any prefix you like. The prefix scrp is used in the examples below.

Tag Summary

Scrape Tags
page Specify the URL of the document to be scraped and the minimum time that must pass before the document is rescraped.
url Use this alternate tag to dynamically specify the URL of the document to be scraped.
scrape Specify the text anchors that mark the beginning and end of the content to be scraped.
result Retrieve the content from a scrape.

Tag Reference

 page Availability: version 1.0 
Specify the URL of the document to be scraped and the minimum time that must pass before the document is rescraped.
 
Tag Body JSP
Script Variable No
Restrictions None
Attributes  
 
Name Required Runtime Expression Evaluation
 url  No  No
The fully qualified URL of the document that is to be scraped, such as:

http://<domain.name/directory/document.html>

Note that if you must dynamically generate the URL, perhaps via a set of tags from a different tag library, you can omit the url attribute in the page tag and instead use the url tag.
 time  No  No
The length of time the JSP waits before attempting to rescrape the document. The value of time is specified in minutes. The minimum value is 10 minutes. Note that the minimum value is used if a time attribute is not specified.
Properties None
Example
<%-- 
  Specify a document to be scraped with a rescrape time of 20 minutes
  Note that a scrape tag must be nested within the body of the page tag
--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page>  <%-- close the page tag --%>
 url Availability: version 1.0 
Specify the URL of the document that contains the content to be scraped. Use this tag as an alternate to the page tag's url attribute when the URL must be generated dynamically.
 
Tag Body JSP
Script Variable No
Restrictions Tag must be nested within a page tag
Attributes None
Properties None
Example
<%-- 
  Specify a document to be scraped
  Note that a url tag must be nested within the body of the page tag
--%>
<scrp:page>
   <scrp:url>http://finance.yahoo.com/q?s=SUNW</scrp:url>
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page>  <%-- close the page tag --%>
<%--   It is possible to use another tag set nested within   the url tag to dynamically generate the URL. --%>
 scrape Availability: version 1.0 
Specify the text anchors that mark the beginning and end of the content to be scraped.
 
Tag Body Empty
Script Variable Yes, id exists from the beginning of this tag to the end of the page.
Restrictions Must be nested within the page tag.
Attributes  
 
Name Required Runtime Expression Evaluation
 id  Yes  No
A unique identifier that distinguishes this scrape from all others. Each scrape is unique and accessible only by this id.
 begin  Yes  No
The text anchor that marks the beginning of the content to be scraped from the document.
 end  Yes  No
The text anchor that marks the end of the content to be scraped from the document.
 strip  No  No
If strip is set to true, the output from the result tag is stripped of HTML, XML, DHTML, etc. tags. That is, nothing within < > will be included in the scrape result. The default value is false. Note that strip can be used in conjunction with the anchors attribute.
 anchors  No  No
If anchors is set to true, the begin and end text anchors are included in the scrape result. The default value is false. Note that anchors can be used in conjunction with the strip attribute.
Properties None
Example
<%-- 
  Set a scrape on a page with anchors included
  Note that the page tag is first and the 
  scrape tag is nested  
--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
  <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page>  <%-- close the page tag --%>

<%-- Set a scrape on a page with results set to have no tags--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
  <scrp:scrape id="qt" begin="<table border=1" end="</table>" strip="true"/>
</scrp:page>  <%-- close the page tag --%>
 result Availability: version 1.0 
Retrieve the content from a scrape.
 
Tag Body Empty
Script Variable No
Restrictions None
Attributes  
 
Name Required Runtime Expression Evaluation
 scrape  Yes  No
The id specified for a previous scrape tag.
Properties None
Example
<%-- get the results of a previously performed scrape --%>
<scrp:result scrape="qt"/>

Examples

See scrape-examples.war for examples that use tags from this custom tag library.

Javadocs

Java programmers can view the java class documentation for this tag library as javadocs.

Revision History

Review the complete revision history of this tag library.