Jakarta: SCRAPE JSP Tag Library

Version 1.0

Overview
Requirements
Configuration
Tag Summary
Tag Reference
Examples
Javadocs
Revision History

Overview

The scrape tag library can scrape or extract content from web documents and display the content in your JSP. For example, you could scrape stock quotes from other web sites and display them in your pages.

After your JSP scrapes a document for the first time, the results of the scrape are cached for subsequent JSP requests. These results are returned unless the JSP determines that the document must be rescraped. Rescraping is determined by the following logic:

The status of the scrape tags and attributes in the JSP is examined. Any modifications to the tags or attributes trigger a rescrape. If the tags have not been modified, the JSP proceeds to step 2.
The minimum time for rescraping, specified by the time attribute of the page tag, is examined. The default time is 10 minutes. If this time has not passed since the last scrape, cached results are returned. If this time has passed, the JSP proceeds to step 3.
The expired header of the scraped document is examined. If the expiration date/time has not passed, cached results are returned. If the expiration date/time is not specified or the document has expired, the JSP proceeds to step 4.
The headers for the scraped document are requested and examined. If the document has not been modified since the last scrape, cached results are returned. If the document has been modified, it is rescraped and the new results are returned.

Requirements

This custom tag library requires no software other than a servlet container that supports the JavaServer Pages Specification, version 1.1.

In addition to the scrape.jar file, you must also have the jakarta-oro-2.0.2-dev-2.jar file. These files are included in the tag library download.

Configuration

Follow these steps to configure your web application with this tag library:

Copy the tag library descriptor file (scrape/scrape.tld) to the /WEB-INF subdirectory of your web application.
Copy the tag library JAR file (scrape/scrape.jar) to the /WEB-INF/lib subdirectory of your web application.
Copy the jakarta oro JAR file (scrape/jakarta-oro-{version}.jar) to the /WEB-INF/lib subdirectory of your web application.

Add a <taglib> element to your web application deployment descriptor in /WEB-INF/web.xml like this:

<taglib>
  <taglib-uri>scrape.jar</taglib-uri>
  <taglib-location>/WEB-INF/scrape.tld</taglib-location>
</taglib>

To use the tags from this library in your JSP pages, add the following directive at the top of each page:

<%@ taglib uri="scrape.jar" prefix="scrp" %>

where "scrp" is the tag name prefix you wish to use for tags from this library. You can change this value to any prefix you like. The prefix scrp is used in the examples below.

Tag Summary

Scrape Tags
page	Specify the URL of the document to be scraped and the minimum time that must pass before the document is rescraped.
url	Use this alternate tag to dynamically specify the URL of the document to be scraped.
scrape	Specify the text anchors that mark the beginning and end of the content to be scraped.
result	Retrieve the content from a scrape.

Tag Reference

page

Availability: version 1.0

Specify the URL of the document to be scraped and the minimum time that must pass before the document is rescraped.

Tag Body

JSP

Script Variable

Restrictions

None

Attributes

Name	Required	Runtime Expression Evaluation
url	No	No
The fully qualified URL of the document that is to be scraped, such as: http://<domain.name/directory/document.html> Note that if you must dynamically generate the URL, perhaps via a set of tags from a different tag library, you can omit the url attribute in the page tag and instead use the url tag.
time	No	No
The length of time the JSP waits before attempting to rescrape the document. The value of time is specified in minutes. The minimum value is 10 minutes. Note that the minimum value is used if a time attribute is not specified.

Properties

None

Example

<%-- 
  Specify a document to be scraped with a rescrape time of 20 minutes
  Note that a scrape tag must be nested within the body of the page tag
--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page>  <%-- close the page tag --%>

url

Availability: version 1.0

Specify the URL of the document that contains the content to be scraped. Use this tag as an alternate to the page tag's url attribute when the URL must be generated dynamically.

Tag Body	JSP
Script Variable	No
Restrictions	Tag must be nested within a page tag
Attributes	None
Properties	None
Example	<%-- Specify a document to be scraped Note that a url tag must be nested within the body of the page tag --%> <scrp:page> <scrp:url>http://finance.yahoo.com/q?s=SUNW</scrp:url> <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" /> </scrp:page> <%-- close the page tag --%> <%-- It is possible to use another tag set nested within the url tag to dynamically generate the URL. --%>

scrape

Availability: version 1.0

Specify the text anchors that mark the beginning and end of the content to be scraped.

Tag Body

Empty

Script Variable

Yes, id exists from the beginning of this tag to the end of the page.

Restrictions

Must be nested within the page tag.

Attributes

Name	Required	Runtime Expression Evaluation
id	Yes	No
A unique identifier that distinguishes this scrape from all others. Each scrape is unique and accessible only by this id.
begin	Yes	No
The text anchor that marks the beginning of the content to be scraped from the document.
end	Yes	No
The text anchor that marks the end of the content to be scraped from the document.
strip	No	No
If strip is set to true, the output from the result tag is stripped of HTML, XML, DHTML, etc. tags. That is, nothing within < > will be included in the scrape result. The default value is false. Note that strip can be used in conjunction with the anchors attribute.
anchors	No	No
If anchors is set to true, the begin and end text anchors are included in the scrape result. The default value is false. Note that anchors can be used in conjunction with the strip attribute.

Properties

None

Example

<%-- 
  Set a scrape on a page with anchors included
  Note that the page tag is first and the 
  scrape tag is nested  
--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
  <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page>  <%-- close the page tag --%>

<%-- Set a scrape on a page with results set to have no tags--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
  <scrp:scrape id="qt" begin="<table border=1" end="</table>" strip="true"/>
</scrp:page>  <%-- close the page tag --%>

result

Availability: version 1.0

Retrieve the content from a scrape.

Tag Body

Empty

Script Variable

Restrictions

None

Attributes

Name	Required	Runtime Expression Evaluation
scrape	Yes	No
The id specified for a previous scrape tag.

Properties

None

Example

<%-- get the results of a previously performed scrape --%>
<scrp:result scrape="qt"/>

Examples

See scrape-examples.war for examples that use tags from this custom tag library.

Javadocs

Java programmers can view the java class documentation for this tag library as javadocs.

Revision History

Review the complete revision history of this tag library.

Jakarta: SCRAPE JSP Tag Library

Version 1.0

Table of Contents

Overview

Requirements

Configuration

Tag Summary

Tag Reference

Examples

Javadocs

Revision History