Jakarta: SCRAPE JSP Tag Library
Version 1.0
Table of Contents
Overview
Requirements
Configuration
Tag Summary
Tag Reference
Examples
Javadocs
Revision History
Overview
The scrape tag library can scrape or extract content from web
documents and display the content in your JSP. For example,
you could scrape stock quotes from other web sites
and display them in your pages.
After your JSP scrapes a document for the first time,
the results of the scrape are cached for subsequent JSP requests.
These results are returned unless the JSP determines that
the document must be rescraped. Rescraping is determined by the following logic:
- The status of the scrape tags and attributes in the JSP is examined. Any
modifications to the tags or attributes trigger a rescrape. If the tags
have not been modified, the JSP proceeds to step 2.
- The minimum time for rescraping, specified by the time attribute of
the page tag, is examined. The default time is 10 minutes.
If this time has not passed since the last scrape, cached results are returned.
If this time has passed, the JSP proceeds to step 3.
- The expired header of the scraped document is examined. If
the expiration date/time has not passed, cached
results are returned. If the expiration date/time
is not specified or the document has expired, the JSP proceeds to step 4.
- The headers for the scraped document are requested and examined. If the
document has not been modified since the last scrape, cached results
are returned. If the document has been modified, it is rescraped and
the new results are returned.
Requirements
This custom tag library requires no software other than a servlet container
that supports the JavaServer Pages Specification, version 1.1.
In addition to the scrape.jar file, you must also have the
jakarta-oro-2.0.2-dev-2.jar file. These files are included in the tag
library download.
Configuration
Follow these steps to configure your web application with this tag library:
To use the tags from this library in your JSP pages, add the following
directive at the top of each page:
<%@ taglib uri="scrape.jar" prefix="scrp" %>
where "scrp" is the tag name prefix you wish to use for
tags from this library. You can change this value to any prefix you like.
The prefix scrp is used in the examples below.
Tag Summary
Scrape Tags |
page |
Specify the URL of the document to be scraped
and the minimum time that must pass before the document is rescraped. |
url |
Use this alternate tag to dynamically specify the URL of the document to be scraped. |
scrape |
Specify the text anchors that mark the
beginning and end of the content to be scraped.
|
result |
Retrieve the content from a scrape. |
Tag Reference
page |
Availability: version 1.0 |
|
Specify the URL of the document to be scraped
and the minimum time that must pass before the document is rescraped.
|
|
Tag Body |
JSP |
Script Variable |
No |
Restrictions |
None |
Attributes |
|
|
Name |
Required |
Runtime Expression Evaluation |
url |
No |
No |
The fully qualified URL of the document that is to
be scraped, such as:
http://<domain.name/directory/document.html>
Note that if you must dynamically generate the URL, perhaps via
a set of tags from a different tag library, you can omit the url
attribute in the page tag and instead use the url tag.
|
time |
No |
No |
The length of time the JSP waits before attempting
to rescrape the document. The value of time is specified in minutes.
The minimum value is 10 minutes. Note that the minimum
value is used if a time attribute is not specified. |
|
Properties |
None |
Example |
<%--
Specify a document to be scraped with a rescrape time of 20 minutes
Note that a scrape tag must be nested within the body of the page tag
--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
<scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page> <%-- close the page tag --%>
|
|
url |
Availability: version 1.0 |
|
Specify the URL of the document that contains the content to
be scraped. Use this tag as an alternate to the page tag's url attribute
when the URL must be generated dynamically.
|
|
Tag Body |
JSP |
Script Variable |
No |
Restrictions |
Tag must be nested within a page tag |
Attributes |
None |
Properties |
None |
Example |
<%--
Specify a document to be scraped
Note that a url tag must be nested within the body of the page tag
--%>
<scrp:page>
<scrp:url>http://finance.yahoo.com/q?s=SUNW</scrp:url>
<scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page> <%-- close the page tag --%>
<%--
It is possible to use another tag set nested within
the url tag to dynamically generate the URL.
--%>
|
|
scrape |
Availability: version 1.0 |
|
Specify the text anchors that mark the
beginning and end of the content to be scraped.
|
|
Tag Body |
Empty |
Script Variable |
Yes, id exists from the beginning of this tag to the end of
the page.
|
Restrictions |
Must be nested within the page tag.
|
Attributes |
|
|
Name |
Required |
Runtime Expression Evaluation |
id |
Yes |
No |
A unique identifier that distinguishes this
scrape from all others. Each scrape is unique and accessible only by
this id. |
begin |
Yes |
No |
The text anchor that marks the beginning of the
content to be scraped from the document. |
end |
Yes |
No |
The text anchor that marks the end of the
content to be scraped from the document. |
strip |
No |
No |
If strip is set to true, the output from the result tag
is stripped of HTML, XML, DHTML, etc. tags. That is, nothing within < > will
be included in the scrape result. The default value is false. Note that strip
can be used in conjunction with the anchors attribute.
|
anchors |
No |
No |
If anchors is set to true, the begin and end text anchors
are included in the scrape result. The default value
is false. Note that anchors can be used in conjunction with the strip attribute.
|
|
Properties |
None |
Example |
<%--
Set a scrape on a page with anchors included
Note that the page tag is first and the
scrape tag is nested
--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
<scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page> <%-- close the page tag --%>
<%-- Set a scrape on a page with results set to have no tags--%>
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
<scrp:scrape id="qt" begin="<table border=1" end="</table>" strip="true"/>
</scrp:page> <%-- close the page tag --%>
|
|
result |
Availability: version 1.0 |
|
Retrieve the content from a scrape. |
|
Tag Body |
Empty |
Script Variable |
No |
Restrictions |
None |
Attributes |
|
|
Name |
Required |
Runtime Expression Evaluation |
scrape |
Yes |
No |
The id specified for a previous scrape tag. |
|
Properties |
None |
Example |
<%-- get the results of a previously performed scrape --%>
<scrp:result scrape="qt"/>
|
|
Examples
See scrape-examples.war for examples that use tags from this custom tag library.
Javadocs
Java programmers can view the java class documentation for this tag library
as javadocs.
Revision History
Review the complete revision history of this tag
library.