This chapter describes how to develop an application using the Unstructured Information Management Architecture (UIMA). The term application describes a program that provides end-user functionality. A UIMA application incorporates one or more UIMA components such as Analysis Engines, Collection Processing Engines, a Search Engine, and/or a Document Store and adds application-specific logic and user interfaces.
An application developer's starting point for accessing
UIMA framework functionality is the com.ibm.uima.UIMAFramework
class. The following is a short
introduction to some important methods on this class. Several of these methods are used in examples
in the rest of this chapter. For more
details, see the JavaDocs (in the docs/api directory of the UIMA SDK).
This section describes how to add analysis capability to your application by using Analysis Engines developed using the UIMA SDK. An Analysis Engine (AE) is a component that analyzes artifacts (e.g. documents) and infers information about them.
An Analysis Engine consists of two parts - Java classes (typically packaged as one or more JAR files) and AE descriptors (one or more XML files). You must put the Java classes in your application’s class path, but thereafter you will not need to directly interact with them. The UIMA framework insulates you from this by providing a standard AnalysisEngine interfaces.
The term Text Analysis Engine (TAE) is sometimes used to describe an Analysis Engine that analyzes a text document. In the UIMA SDK v1.x, there was a TextAnalysisEngine interface that was commonly used. However, as of the UIMA SDK v2.0, this interface has been deprecated and all applications should switch to using the standard AnalysisEngine interface.
The AE descriptor XML files contain the configuration settings for the Analysis Engine as well as a description of the AE’s input and output requirements. You may need to edit these files in order to configure the AE appropriately for your application - the supplier of the AE may have provided documentation (or comments in the XML descriptor itself) about how to do this.
The following code shows how to instantiate an AE from its XML descriptor:
{ //get Resource Specifier from XML file or PEAR XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in);
//create AE here AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier); }
The first two lines parse the XML descriptor (for AEs with multiple
descriptor files, one of them is the "main" descriptor - the AE
documentation should indicate which it is). The result of the parse is a ResourceS
pecifier
object. The third line of code invokes a static
factory method UIMAFramework.produceAnalysisEngine
, which takes
the specifier and instantiates an AnalysisEngine
object.
There is one caveat to using this approach - the Analysis Engine instance that you create will not support multiple threads running through it concurrently. If you need to support this, see section 6.2.6 .
There are two ways to use the AE interface to analyze documents. You can either use the JCas interface, which is described in detail by Chapter 27, JCas Reference or you can directly use the CAS interface, which is described in detail in Chapter 26, CAS Reference Besides text documents, other kinds of artifacts can also be analyzed; see Chapter 8, Annotations, Artifacts, and Sofas for more information.
The basic structure of your application will look similar in both cases:
Using the JCas
{ //create a JCas, given an Analysis Engine (ae) JCas jcas = ae.newDefaultTextJCas();
// this is shorthand for the following steps: // CAS aCas = ae.newCAS(); // CAS aCasView = aCas.createDefaultTextView(); // JCas jcas = aCasView.createJCas();
//analyze a document jcas.setDocumentText(doc1text); ae.process(jcas); doSomethingWithResults(jcas); jcas.reset();
//analyze another document jcas.setDocumentText(doc2text); ae.process(jcas); doSomethingWithResults(jcas); jcas.reset(); ... //done ae.destroy(); }
Using the CAS
{ //create a CAS CAS aCasView = ae.newDefaultTextCAS();
// this is shorthand for the following steps: // CAS aCas = ae.newCAS(); // CAS aCasView = aCas.createDefaultTextView();
//analyze a document aCasView.setDocumentText(doc1text); ae.process(aCasView); doSomethingWithResults(aCasView); aCasView.reset();
//analyze another document aCasView.setDocumentText(doc2text); ae.process(aCasView); doSomethingWithResults(aCasView); aCasView.reset(); ... //done ae.destroy(); }
First, you create the CAS or JCas that you will use. Then, you repeat the following four steps for each document:
Analyzing
non-text artifacts is similar to analyzing text documents. The main difference is that instead of using
the setDocumentText
method, you need to use the Sofa APIs to create an
artifact plus (perhaps multiple) views of it. See Annotations,
Artifacts, and Sofas for details.
See:
com.ibm.uima.examples.AnnotationFilter
, which is in docs\examples\src
.com.ibm.uima.jcas.impl.JCas
.
Analysis results are accessed using the CAS Indexes. You obtain iterators over specified types; the iterator returns the matching elements one at time from the CAS. For an example of this, see:
com.ibm.uima.examples.PrintAnnotations
, which is in docs\examples\src.
com.ibm.uima.cas an
d
com.ibm.uima.cas.text
packages.
The simplest way to use an AE in a multi-threaded environment is to use the Java synchronized keyword to ensure that only one thread is using an AE at any given time. For example:
public class MyApplication { private AnalysisEngine mAnalysisEngine; private CAS mCAS; public MyApplication() { //get Resource Specifier from XML file or PEAR XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in); //create Analysis Engine here mAnalysisEngine = UIMAFramework.produceAnalysisEngine(specifier); mCAS = mAnalysisEngine.newDefaultTextCAS(); } // Assume some other part of your multi-threaded application could // call "analyzeDocument" on different threads, asynchronusly public synchronized void analyzeDocument(String aDoc) { //analyze a document mCAS.setDocumentText(aDoc); mAnalysisEngine.process(); doSomethingWithResults(mCAS); mCAS.reset(); } ... }
Without the synchronized keyword, this application would not be thread-safe. If multiple threads called the analyzeDocument method simultaneously, they would both use the same CAS and clobber each others' results. The synchronized keyword ensures that no more than one thread is executing this method at any given time. For more information on thread synchronization in Java, see http://java.sun.com/docs/books/tutorial/essential/threads/multithreaded.html.
The synchronized keyword ensures thread-safety, but does not allow you to process more than one document at a time. If you need to process multiple documents simultaneously (for example, to make use of a multiprocessor machine), you’ll need to use more than one CAS instance.
Because CAS instances use memory and can take some time to
construct, you don't want to create a new CAS instance for each request. Instead, you should use a feature of the UIMA
SDK called the CAS Pool, implemented by the type CasPool
.
A CAS Pool contains some number of CAS instances (you specify how many when you create the pool). When a thread wants to use a CAS, it checks out an instance from the pool. When the thread is done using the CAS, it must release the CAS instance back into the pool. If all instances are checked out, additional threads will block and wait for an instance to become available. Here is some example code:
public class MyApplication { private CasPool mCasPool;
private AnalysisEngine mAnalysisEngine;
public MyApplication() { //get Resource Specifier from XML file or PEAR XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in); //create multithreadable AE that will //accept 3 simultaneous requests mAnalysisEngine = UIMAFramework.produceAnalysisEngine(specifier,3); //create CAS pool with 3 CAS instances mCasPool = new CasPool(mAnalysisEngine,3); } public void analyzeDocument(String aDoc) { //check out a CAS instance (argument 0 means no timeout) CAS cas = mCasPool.getCas(0); try { //analyze a document cas.setDocumentText(aDoc); mAnalysisEngine.process(cas); doSomethingWithResults(cas); } finally { //MAKE SURE we release the CAS instance mCasPool.releaseCas(cas); } } ... }
There is not much more code required here than in the previous example. First, there is one additional parameter to the AnalysisEngine producer, specifying the number of annotator instances to create Both the UIMA Collection Processing Manager framework and the remote deployment services framekwork have implementations which use CAS pools in this manner, and thereby relieve the annotator developer of the necessity to make their annotators thread-safe.. Then, instead of creating a single CAS in the constructor, we now create a CasPool containing 3 instances. In the analyze method, we check out a CAS, use it, and then release it.
The getCAS() method returns a CAS which is not specialized to any particular subject of analysis. To process things other than this, please refer to Annotations, Artifacts, and Sofas
Note the use of the try...finally block. This is very important, as it ensures that the CAS we have checked out will be released back into the pool, even if the analysis code throws an exception. You should always use try...finally when using the CAS pool; if you do not, you risk exhausting the pool and causing deadlock.
The parameter 0 passed to the CasPool.getCas() method is a timeout value. If this is set to a positive integer, it is the maximum number of milliseconds that the thread will wait for an instance to become available in the pool. If this time elapses, the getCas method will return null, and the application can do something intelligent, like ask the user to try again later. A value of 0 will cause the thread to wait forever.
In most cases, the easiest way to use multiple Analysis Engines from within an application is to combine them into an aggregate AE. For instructions, see section 4.3, Building Aggregate Analysis Engines. Be sure that you understand this method before deciding to use the more advanced feature described in this section.
If you decide that your application does need to instantiate
multiple AEs and have those AEs share a single CAS, then you will no longer be
able to use the various methods on the AnalysisEngine
class that create CASes (or
JCases) to create your
CAS. This is because these methods
create a CAS with a data model specific to a single AE and which therefore
cannot be shared by other AEs. Instead,
you create a CAS as follows:
Suppose you have two analysis engines, and one CAS Consumer, and you want to create one type system from the merge of all of their type specifications. Then you can do the following:
AnalysisEngineDescription aeDesc1 = UIMAFramework.getXMLParser().parseAnalysisEngineDescription(...);
AnalysisEngineDescription aeDesc2 = UIMAFramework.getXMLParser().parseAnalysisEngineDescription(...);
CasConsumerDescription ccDesc = UIMAFramework.getXMLParser().parseCasConsumerDescription(...);
List list = new ArrayList();
list.add(aeDesc1); list.add(aeDesc2); list.add(ccDesc);
CAS cas = CasCreationUtils.createCas(list);
// once you have this CAS, you need to create the view you want of it, and // also (optionally) the JCas Interface to it
CAS casView = cas.createView("mySofaName", mime-type); // (OR) CAS casView = cas.createDefaultText View(); // (optional) JCas jcas = casView.getJCas();
The CasCreationUtils class takes care of the work of merging the AEs' type systems and producing a CAS for the combined type system. If the type systems are not compatible, an exception will be thrown.
The UIMA framework provides APIs to save and restore the contents of a CAS to streams. The CASes are stored in an XML format. There are two forms of this format. The preferred form is the XMI form (see Using XMI CAS Serialization ). An older format is also available, called XCAS.
To save an XMI representation of a CAS, use the method com.ibm.uima.util.XmlCasSerializer
. To save an XCAS representation of a CAS, use
the method com.ibm.uima.cas.impl.XCASSerializer.serialize
;
see the JavaDocs (page 25-347) for details.
Both of these external forms can be read back in, using
the com.ibm.uima.util.XmlCasDeserializer
method. This method deserializes into a pre-existing
CAS, which you must create ahead of time, pre set up with the proper type
system. See the JavaDocs for details.
A Collection Processing Engine (CPE) processes collections of artifacts (documents) through the combination of the following components: a Collection Reader, an optional CAS Initializer, Analysis Engines, and CAS Consumers. Collection Processing Engines and their components are described in Chapter 5, Collection Processing Engine Developer's Guide.
Like Analysis Engines, CPEs consist of a set of Java classes and a set of descriptors. You need to make sure the Java classes are in your classpath, but otherwise you only deal with descriptors.
Section 5.3, Running a CPE from Your Own Java Application describes how to use the APIs to read a CPE descriptor and run it from an application.
For the finest level of control over the CPE descriptor settings, the CPE offers programmatic access to the descriptor via an API. With this API, a developer can create a complete descriptor and then save the result to a file. This also can be used to read in a descriptor (using XMLParser.parseCpeDescription as shown in the previous section), modify it, and write it back out again. The CPE Descriptor API allows a developer to redefine default behavior related to error handling for each component, turn-on check-pointing, change performance characteristics of the CPE, and plug-in a custom timer.
Below is some example code that illustrates how this works. See the JavaDocs for package com.ibm.uima.collection.metadata for more details.
//Creates descriptor with default settings CpeDescription cpe = CpeDescriptorFactory.produceDescriptor();
//Add CollectionReader cpe.addCollectionReader([descriptor]);
//Add CasInitializer cpe.addCasInitializer(<cas initializer descriptor>);
// Provide the number of CASes the CPE will use
cpe.setCasPoolsSize(2);
// Define and add Analysis Engine CpeIntegratedCasProcessor personTitleProcessor = CpeDescriptorFactory.produceCasProcessor ("Person");
// Provide descriptor for the Analysis Engine personTitleProcessor.setDescriptor([descriptor]);
//Continue, despite errors and skip bad Cas personTitleProcessor.setActionOnMaxError("terminate");
//Increase amount of time in ms the CPE waits for response //from this Analysis Engine personTitleProcessor.setTimeout(100000);
//Add Analysis Engine to the descriptor cpe.addCasProcessor(personTitleProcessor); // Define and add CAS Consumer CpeIntegratedCasProcessor consumerProcessor = CpeDescriptorFactory.produceCasProcessor("Printer"); consumerProcessor.setDescriptor([descriptor]);
//Define batch size consumerProcessor.setBatchSize(100);
//Terminate CPE on max errors personTitleProcessor.setActionOnMaxError("terminate");
//Add CAS Consumer to the descriptor cpe.addCasProcessor(consumerProcessor);
// Add Checkpoint file and define checkpoint frequency (ms) cpe.setCheckpoint("[path]/checkpoint.dat", 3000);
// Plug in custom timer class used for timing events cpe.setTimer("com.ibm.uima.reference_impl.util.JavaTimer");
// Define number of documents to process cpe.setNumToProcess(1000);
// Dump the descriptor to the System.out ((CpeDescriptionImpl)cpe).toXML(System.out);
The CPE descriptor for the above configuration looks like this:
<?xml version="1.0" encoding="UTF-8"?> <cpeDescription xmlns="http://uima.apache.org/resourceSpecifier"> <collectionReader> <collectionIterator> <descriptor> <include href="[descriptor]"/> </descriptor> <configurationParameterSettings>...</configurationParameterSettings> </collectionIterator>
<casInitializer> <descriptor> <include href="[descriptor]"/> </descriptor> <configurationParameterSettings>...</configurationParameterSettings> </casInitializer> </collectionReader>
<casProcessors casPoolSize="2"
processingUnitThreadCount="1">
<casProcessor deployment="integrated" name="Person">
<descriptor>
<include href="[descriptor]"/>
</descriptor>
<deploymentParameters/>
<errorHandling>
<errorRateThreshold action="terminate" value="100/1000"/>
<maxConsecutiveRestarts action="terminate" value="30"/>
<timeout max="100000"/>
</errorHandling>
<checkpoint batch="100" time="1000ms"/>
</casProcessor>
<casProcessor deployment="integrated" name="Printer"> <descriptor> <include href="[descriptor]"/> </descriptor> <deploymentParameters/> <errorHandling> <errorRateThreshold action="terminate" value="100/1000"/> <maxConsecutiveRestarts action="terminate" value="30"/> <timeout max="100000" default="-1"/> </errorHandling> <checkpoint batch="100" time="1000ms"/> </casProcessor> </casProcessors>
<cpeConfig> <numToProcess>1000</numToProcess> <deployAs>immediate</deployAs> <checkpoint file="[path]/checkpoint.dat" time="3000ms"/> <timerImpl> com.ibm.uima.reference_impl.util.JavaTimer</timerImpl> </cpeConfig> </cpeDescription>
Configuration parameters can be set using APIs as well as configured using the XML descriptor metadata specification (see Configuration Parameters ).
There are two different places you can set the parameters via the APIs.
Setting the parameters before you produce the component is done using the ConfigurationParameterSettings object. You get an instance of this for a particular component by accessing that component description's metadata. For instance, if you produced a component description by using UIMAFramework.getXMLParser().parse... method, you can use that component description's getMetaData() method to get the metadata, and then the metadata's getConfigurationParameterSettings method to get the ConfigurationParameterSettings object. Using that object, you can set individual parameters using the setParameterValue method. Here's an example, for a CAS Consumer component:
// Create a description
object by reading the XML for the descriptor
CasConsumerDescription
casConsumerDesc =
UIMAFramework.getXMLParser().parseCasConsumerDescription(
new
XMLInputSource("descriptors/cas_consumer/InlineXmlCasConsumer.xml"));
// get the settings from
the metadata
ConfigurationParameterSettings consumerParamSettings =
casConsumerDesc.getMetaData().getConfigurationParameterSettings();
// Set a parameter value
consumerParamSettings.setParameterValue(
InlineXmlCasConsumer.PARAM_OUTPUTDIR,
outputDir.getAbsolutePath());
Then you might produce this component using:
CasConsumer component =
UIMAFramework.produceCasConsumer(casConsumerDesc);
A side effect of producing a component is calling the component's "initialize" method, allowing it to read its configuration parameters. If you want to change parameters after this, use
component.setConfigParameterValue("<parameter-name>",
"<parameter-value>");
and then signal the component to re-read its configuration by calling the component's reconfigure method:
component.reconfigure();
Although these examples are for a CAS Consumer component, the parameter APIs also work for other kinds of components.
The UIMA SDK includes a search engine that you can use to build a search index that includes the results of the analysis done by your AE. This combination of AEs with a search engine capable of indexing both words and annotations over spans of text enables what UIMA refers to as semantic search.
Semantic search is a search where the semantic intent of the query is specified using one or more entity or relation specifiers. For example, one could specify that they are looking for a person (named) "Bush." Such a query would then not return results about the kind of bushes that grow in your garden.
To build a semantic search index using the UIMA SDK, you
run a Collection Processing Engine that includes your AE along with a CAS
Consumer called the Semantic Search CAS Indexer, which is provided with
the UIMA SDK. Your AE must include an
annotator that produces Tokens and Sentence annotations, along with any
"semantic" annotations, because the Indexer requires this. The Semantic Search CAS Indexer's descriptor
is located at: docs/examples/descriptors/cas_consumer/SemanticSearchCasIndexer.xm
l.
Since there are several ways you might want to build a
search index from the information in the CAS produced by your AE, you need to
supply the Semantic Search CAS Indexer with configuration information in the
form of an Index Build Specification file. An example of an Indexing specification
tailored to the AE from the tutorial in the Chapter 4, Annotator and Analysis Engine Developer’s Guide is located in docs/examples/descriptors/tutorial/search/MeetingIndexBuildSpec.xml
. It looks like this:
<indexBuildSpecification> <indexBuildItem> <name>com.ibm.uima.examples.tokenizer.Token</name> <indexRule> <style name="Term"/> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.examples.tokenizer.Sentence</name> <indexRule> <style name="Breaking"/> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.tutorial.Meeting</name> <indexRule> <style name="Annotation"/> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.tutorial.RoomNumber</name> <indexRule> <style name="Annotation"> <attributeMappings> <mapping> <feature>building</feature> <indexName>building</indexName> </mapping> </attributeMappings> </style> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.tutorial.DateAnnot</name> <indexRule> <style name="Annotation"/> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.tutorial.TimeAnnot</name> <indexRule> <style name="Annotation"/> </indexRule> </indexBuildItem> </indexBuildSpecification>
The index build specification is a series of index build items,
each of which identifies a CAS annotation type (a subtype of uima.tcas.Annotation
– see Chapter 26, CAS Reference) and a style.
The first item in this example specifies that the
annotation type com.ibm.uima.examples.tokenizer.Token
should be indexed with the
"Term" style. This means that
each span of text annotated by a Token will be considered a single token for
standard text search purposes.
The second item in this example specifies that the
annotation type com.ibm.uima.examples.tokenizer.Sentence
should be indexed with the
"Breaking" style. This means
that each span of text annotated by a Sentence will be considered a single
sentence, which can affect that search engine's algorithm for matching
queries. The semantic search engine
always requires tokens and sentences in order to index a document.
The remaining items all use the "Annotation" style. This indicates that each annotation of the specified types will be stored in the index as a searchable span, with a name equal to the annotation name (without the namespace).
Also, features of annotations can be indexed using the <attributeMappings>
subelement. In the example index build specification, we
declare that the building
feature of the type com.ibm.uima.tutorial.RoomNumber
should be indexed. The <indexName>
element can be used to map the feature name to a different name in the index,
but in this example we have opted to use the same name, building.
At the end of the batch or collection, the Semantic Search CAS Indexer builds the index. This index can be queried with simple tokens or with xml tags
Examples :
<Meeting>UIMA</Meeting>
will retrieve only
those documents that contain a Meeting annotation (produced by our
MeetingDetector TAE, for example), where that Meeting annotation contains the
word "UIMA". building
feature contains the term "Yorktown".
More information on the syntax of these kinds of queries, called XML Fragments, can be found in Chapter 28, Semantic Search Engine Reference .
For more information on the Index Build Specification
format, see the UIMA JavaDocs for class com.ibm.uima.search.IndexBuildSpec
ification
. Accessing the JavaDocs is described .
The following steps illustrate how to build and run a CPE that uses the UIMA Meeting Detector TAE and the Simple Token and Sentence Annotator, discussed in the Chapter 4, Annotator and Analysis Engine Developer’s Guide along with the Semantic Search CAS Indexer, to build an index that allows you to query for documents based not only on textual content but also on whether they contain mentions of Meetings detected by the TAE.
Run the CPE
Configurator tool by executing the cpeGui
shell script in the bin
directory of the UIMA SDK. (For
instructions on using this tool, see the Chapter 13, Collection Processing
Engine Configurator User's Guide.)
In the CPE Configurator tool, select the following components by browsing to their descriptors:
%UIMA_HOME%/docs/examples/descriptors/collectionReader/
FileSystemCollectionReader.xml
%UIMA_HOME%/docs/examples/descriptors/analysis_engine/
SimpleTokenAndSentenceAnnotator.xml
and
%UIMA_HOME%/docs/examples/descriptors/tutorial/ex6/ UIMAMeetingDetectorTAE.xml
%UIMA_HOME%/docs/examples/descriptors/casConsumer/ SemanticSearchCasIndexer.xml
%UIMA_HOME%/docs/examples/descriptors/casConsumer/ XCasWriterCasConsumer.xml
Set up parameters :
%UIMA_HOME%/docs/examples/data
directory.%UIMA_HOME%/docs/examples/descriptors/tutorial/search/
MeetingIndexBuildSpec.xml
Click on the Run Button. Once the run completes, a statistics dialog should appear, in which you can see how much time was spent in each of the components involved in the run.
The UIMA SDK contains a simple tool for running queries
against a semantic search index. After
building an index as described in the previous section, you can launch this
tool by running the shell script: semanticSearch, found in the /bin
subdirectory of the UIMA install, at the command
prompt. If you are using Eclipse, and
have installed the UIMA examples, there will be a Run configuration you can use
to conveniently launch this, called UIMA Semantic Search
. This will display the following screen:
Configure the first three fields on this screen as follows:
%UIMA_HOME%/docs/examples/
descriptors/tutorial/ex4/TutorialTypeSystem.xml
Now, in the "XML Fragments" field, you can type
in single words or xml queries where the xml tags correspond to the labels in
the index build specification file (e.g. <Meeting>UIM
A</Meeting>
). XML Fragments are described in Chapter
28 28-377.
After you enter a query and click the "Search" button, a list of hits will appear. Select one of the documents and click "View Analysis" to view the document in the UIMA Annotation Viewer.
The source code for the Semantic Search query program is
in docs/examples/src/com/ibm/uima/examples/search/SemanticSearchGUI.java
. A simple command-line query program is also
provided in docs/examples/src/com/ibm/uima/examples/search/SemanticSearch.java
. Using these as a model, you can build a query
interface from your own application. For
details on the Semantic Search Engine query language and interface, see Chapter 28, Semantic Search
Engine Reference.
The UIMA SDK allows you to easily take any Analysis Engine or CAS Consumer and deploy it as a service. That Analysis Engine or CAS Consumer can then be called from a remote machine.
The UIMA SDK provides support for two communications protocols
The UIMA framework can make use of these services in two different ways:
To deploy a UIMA component as a SOAP Web Service, you need to first install the following software components:
Later versions of these components will likely also work, but have not been tested.
Next, you need to do the following three setup steps:
%UIMA_HOME%/lib
to the %CATALINA_HOME%/webapps/axis/WEB-INF/lib
in your installation.%CATALINA_HOME%/webapps/axis/WEB-INF/lib
in your installation. uima_examples.jar
file which is included in the %UIMA_HOME%/lib
folder.%CATALINA_HOME%/webapps/axis/WEB-INF/classes
. docs/examples/descriptors/analysis_engine/
GovernmentOfficialRecognizer_RegEx_TAE
) as a SOAP service, you need to
copy the file docs/examples/resources/GovernmentTitlePatterns.dat
into .../WEB-INF/classes
.
Test your installation of Tomcat and Axis by starting
Tomcat and going to http://localhost:8080/axis/happyaxis.jsp
in your browser. Check to be sure that
this reports that all of the required Axis libraries are present. One common missing file may be
activation.jar, which you can get from java.sun.com.
After completing these setup instructions, you can deploy
Analysis Engines or CAS Consumers as SOAP web services by using the deploytool
utility, with is located in the /bin
directory of the UIMA SDK. deploytool
is a
command line program utility that takes as an argument a web services
deployment descriptors (WSDD file); example WSDD files are provided in the docs\examples\deploy\soap
directory of the UIMA SDK. Deployment Descriptors have been provided for
deploying and undeploying some of the example Analysis Engines that come with
the SDK.
As an example, the WSDD file for deploying the example Person Title annotator looks like this (important parts are in bold italics):
<deployment name="PersonTitleAnnotator" xmlns="http://xml.apache.org/axis/wsdd/" xmlns:java="http://xml.apache.org/axis/wsdd/providers/java">
<service name="urn:PersonTitleAnnotator" provider="java:RPC">
<parameter name="scope" value="Request"/>
<parameter name="className" value="com.ibm.uima.reference_impl.analysis_engine.service.soap.AxisAnalysisEngineService_impl"/>
<parameter name="allowedMethods" value="getMetaData process"/> <parameter name="allowedRoles" value="*"/> <parameter name="resourceSpecifierPath" value="C:/Program Files/apache/ uima/docs/examples/descriptors/analysis_engine/PersonTitleAnnotator.xml"/>
<parameter name="numInstances" value="3"/> <parameter name="timeoutPeriod" value="30000"/>
<!-- Type Mappings omitted from this document; you will not need to edit them. -->
<typeMapping .../> <typeMapping .../> <typeMapping .../>
</service>
</deployment>
To modify this WSDD file to deploy your own Analysis Engine or CAS Consumer, just replace the areas indicated in bold italics (deployment name, service name, and resource specifier path) with values appropriate for your component.
The timeoutPeriod
parameter
only is used when there are multiple clients accessing the service. When a new request comes in, if the service
is busy with other requests (all instances are busy, in the case where it has
multiple instances), it waits for one to become available - and this parameter
specifies the maximum time for that wait. If it takes longer than this, the service wrapper will throw an
exception back to the client and abort the processing for this document on the
service.
To deploy the Person Title annotator service, issue the following command:
C:Program FilesIBMuima>bindeploytool docsexamplesdeploysoapDeploy_PersonTitleAnnotator.wsdd
Test if the deployment was successful by starting up a browser, pointing it to your TomCat installation's "axis" webpage (e.g., http://localhost:8080/axis) and clicking on the List link. This should bring up a page which shows the deployed services, where you should see the service you just deployed.
The other components can be deployed by replacing Deploy_PersonTitleAnnotator.wsdd
with one of the other
Deploy descriptors in the deploy directory. The deploytool utility can also undeploy services when passed one of the
Undeploy descriptors.
Note: The deploytool
shell
script assumes that the web services are to be installed at http://localhost:8080/axis
. If this is not the case, you will need to
update the shell script appropriately.
Once you have deployed your component as a web service, you may call it from a remote machine. See "How to Call a UIMA Service," below, for instructions.
There are no software prerequisites for deploying a Vinci service. The necessary libraries are part of the UIMA SDK. However, before you can use Vinci services you need to deploy the Vinci Naming Service (VNS), as described in section 6.6.5 .
To deploy a service, you have to insure any components you
want to include can be found on the class path. One way to do this is to set the environment variable UIMA_CLASSPATH to
the set of class paths you need for any included components. Then run the startVinciService
shell script, which is located in the UIMA SDK bin directory, and pass it the
path to a Vinci deployment descriptor, for example:
C:UIMA>binstartVinciService docsexamplesdeployvinciDeploy_PersonTitleAnnotator.xml
This example deployment descriptor looks like:
<deployment name="Vinci Person Title Annotator Service">
<service name="uima.annotator.PersonTitleAnnotator" provider="vinci">
<parameter name="serializerClassName" value="com.ibm.uima.reference_impl.analysis_engine.service.vinci.VinciXCASSerializer_NoDocText"/>
<parameter name="resourceSpecifierPath" value="C:/Program Files/apache-uima/docs/examples/descriptors/analysis_engine/PersonTitleAnnotator.xml"/>
<parameter name="numInstances" value="1"/>
<parameter name="timeoutPeriod" value="30000"/>
<parameter name="serverSocketTimeout" value="120000"/>
</service>
</deployment>
To modify this deployment descriptor to deploy your own Analysis Engine or CAS Consumer, just replace the areas indicated in bold italics (deployment name, service name, and resource specifier path) with values appropriate for your component.
The timeoutPeriod
parameter
only is used when there are multiple clients accessing the service. When a new request comes in, if the service
is busy with other requests (all instances are busy, in the case where it has
multiple instances), it waits for one to become available - and this parameter
specifies the maximum time for that wait. If it takes longer than this, the service wrapper will throw an
exception back to the client and abort the processing for this document on the
service.
The serverSocketTimeout
parameter specifies the number of milliseconds (default = 5 minutes) that the
service will wait between requests to process something. After this amount of time, the server will
presume the client may have gone away - and it "cleans up", releasing
any resources it is holding. The next
call to process on the service will result in a cycle which will cause the
client to re-establish its connection with the service (some additional
overhead).
The startVinciService script takes two additional optional parameters. The first one overrides the value of the VNS_HOST environment variable, allowing you to specify the name server to use. The second parameter if specified needs to be a unique (on this server) non-negative number, specifying the instance of this service. When used, this number allows multiple instances of the same named service to be started on one server; they will all register with the Vinci name service and be made available to client requests.
Once you have deployed your component as a web service, you may call it from a remote machine. See "How to Call a UIMA Service," below, for instructions.
Once an Analysis Engine or CAS Consumer has been deployed as a service, it can be used from any UIMA application, in the exact same way that a local Analysis Engine or CAS Consumer is used. For example, you can call an Analysis Engine service from the Document Analyzer or use the CPE Configurator to build a CPE that includes Analysis Engine and CAS Consumer services.
To do this, you use a service client descriptor in
place of the usual Analysis Engine or CAS Consumer Descriptor. A service client descriptor is a simple XML
file that indicates the location of the remote service and a few
parameters. Example service client
descriptors are provided in the UIMA SDK under the directories docs/examples/descriptors/soapService
and docs/examples/descriptors/vinciService
. The contents of these descriptors are
explained below.
Also, before you can call a SOAP service, you need to have
the necessary Axis JAR files in your classpath. If you use any of the scripts in the /bin
directory of the UIMA installation to launch your application, such as
documentAnalyzer, these JARs are added to the classpath, automatically, using
the CATALINA_HOME
environment variable. The
required files are the following (all part of the Apache Axis download):
The descriptor used to call the PersonTitleAnnotator SOAP service from the example above is:
<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> <resourceType>AnalysisEngine</resourceType> <uri>http://localhost:8080/axis/services/urn:PersonTitleAnnotator</uri> <protocol>SOAP</protocol> </uriSpecifier>
The <resourceType> element must contain either AnalysisEngine or CasConsumer. This specifies what type of component you expect to be at the specified service address.
The <uri> element describes which service to call. It specifies the host (localhost, in this example) and the service name (urn:PersonTitleAnnotator), which must match the name specified in the deployment descriptor used to deploy the service.
To call a Vinci service, a similar descriptor is used:
<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> <resourceType>AnalysisEngine</resourceType> <uri>uima.annot.PersonTitleAnnotator</uri> <protocol>Vinci</protocol> <timeout>60000</timeout> <parameters> <parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/> <parameter name="VNS_PORT" value="9000"/> </parameters> </uriSpecifier>
Note that
Vinci uses a centralized naming server, so the host where the service is
deployed does not need to be specified. Only a name (uima.annot.PersonTitleAnnotator
) is given, which must match the name specified in the deployment
descriptor used to deploy the service.
The host and/or
port where your Vinci Naming Service (VNS) server is running can be specified
by the optional <parameter> elements. If not specified, the value is taken from the specification given your
Java command line (if present) using
-DVNS_HOST=<host>
and -DVNS_PORT=<port>
system arguments. If not specified on the Java command line,
defaults are used: localhost for the VNS_HOST
, and 9000
for the VNS_PORT
. See the next section for details on setting up a VNS server.
Remotely deployed services are started on remote machines, using UIMA component descriptors on those remote machines. These descriptors supply any configuration and resource parameters for the service (configuration parameters are not transmitted from the calling instance to the remote one). Likewise, the remote descriptors supply the type system specification for the remote annotators that will be run (the type system of the calling instance is not transmitted to the remote one).
The remote service wrapper, when it receives a CAS from the caller, instantiates it for the remote service, making instances of all types which the remote service specifies. Other instances in the incoming CAS for types which the remote service has no type specification for are kept aside, and when the remote service returns the CAS back to the caller, these type instances are re-merged back into the CAS being transmitted back to the caller. Because of this design, a remote service which doesn't declare a type system won't receive any type instances.
Vinci consists of components for building network-accessible services, clients for accessing those services, and an infrastructure for locating and managing services. The primary infrastructure component is the Vinci directory, known as VNS (for Vinci Naming Service).
On startup, Vinci services locate the VNS and provide it with information that is used by VNS during service discovery. Vinci service provides the name of the host machine on which it runs, and the name of the service. The VNS internally creates a binding for the service name and returns the port number on which the Vinci service will wait for client requests. This VNS stores its bindings in a filesystem in a file called vns.services.
In Vinci, services are identified by their service name. If there is more than one physical service with the same service name, then Vinci assumes they are equivalent and will route queries to them randomly, provided that they are all running on different hosts. You should therefore use a unique service name if you don't want to conflict with other services listed in whatever VNS you have configured jVinci to use.
To run the VNS use the startVNS
script found in the /bin directory of the UIMA installation.
java.net.BindException: Address already in use:
JVM_Bind
-p <port>
to the startVNS
command, using <port>
to specify an alternative port to use.
When started, the VNS produces output similar to the following:
[10/6/04 3:44 PM | main] WARNING: Config file doesn©t exist, creating a new empty config file!
[10/6/04 3:44 PM | main] Loading config file : .vns.services
[10/6/04 3:44 PM | main] Loading workspaces file : .vns.workspaces
[10/6/04 3:44 PM | main] ====================================
(WARNING) Unexpected exception:
java.io.FileNotFoundException: .vns.workspaces (The system cannot find
the file specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(Unknown Source)
at java.io.FileInputStream.<init>(Unknown Source)
at java.io.FileReader.<init>(Unknown Source)
at com.ibm.vinci.transport.vns.service.VNS.loadWorkspaces(VNS.java:339) at com.ibm.vinci.transport.vns.service.VNS.startServing(VNS.java:237)
at com.ibm.vinci.transport.vns.service.VNS.main(VNS.java:179)
[10/6/04 3:44 PM | main] WARNING: failed to load workspace.
[10/6/04 3:44 PM | main] VNS Workspace : null
[10/6/04 3:44 PM | main] Loading counter file : .vns.counter
[10/6/04 3:44 PM | main] Could not load the counter file : .vns.counter
[10/6/04 3:44 PM | main] Starting backup thread, using files .vns.services.bak
and .vns.services
[10/6/04 3:44 PM | main] Serving on port : 9000
[10/6/04 3:44 PM | Thread-0] Backup thread started
[10/6/04 3:44 PM | Thread-0] Saving to config file : .vns.services.bak
>>>>>>>>>>>>> VNS is up and running! <<<<<<<<<<<<<<<<<
>>>>>>>>>>>>> Type ©quit© and hit ENTER to terminate VNS <<<<<<<<<<<<<
[10/6/04 3:44 PM | Thread-0] Config save required 10 millis.
[10/6/04 3:44 PM | Thread-0] Saving to config file : .vns.services
[10/6/04 3:44 PM | Thread-0] Config save required 10 millis.
[10/6/04 3:44 PM | Thread-0] Saving counter file : .vns.counter
[
10/6/04
3:44 PM | main] Serving on port : 9000
which states the actual port where VNS will listen for incoming requests. All Vinci services and all clients connecting to services must provide the VNS port on the command line IF the port is not a default. Again the default port is 9000. Please see section Launching Vinci Services below for details about the command line and parameters.
The VNS maintains two external files
vns.services
vns.counter
These files are generated by the VNS in the same directory where the VNS is launched from. Since these files may contain old information it is best to remove them before starting the VNS. This step ensures that the VNS has always the newest information and will not attempt to connect to a service that has been shutdown.
When launching Vinci service, you must indicate which VNS
the service will connect to. A Vinci
service is typically started using the script startVinciService
,
found in the /bin directory of the UIMA installation. The environmental variable VNS_HOST should
be set to the name or IP address of the machine hosting the Vinci Naming
Service. The default is localhost, the
machine the service is deployed on. This
name can also be passed as the second argument to the startVinciService script. The default port for VNS is 9000 but can be
overriden with the VNS_PORT environmental variable.
If you write your own startup script, to define Vinci’s default VNS you must provide the following JVM parameters:
java -DVNS_HOST=localhost -DVNS_PORT=9000 ...
The above setting is for the VNS running on the same machine as the service. Of course one can deploy the VNS on a different machine and the JVM parameter will need to be changed to this:
java -DVNS_HOST=<host> -DVNS_PORT=9000 ...
where ‘<host>‘ is a machine name or its IP where the VNS is running.
Note: VNS runs on port 9000 by default. When you see the following exception:
(WARNING) Unexpected exception:
com.ibm.vinci.transport.ServiceDownException: VNS inaccessible: java.net.Connect
Exception: Connection refused: connect
then, perhaps the VNS is not running OR the VNS is running but it is using a different port. To correct the latter, set the environmental variable VNS_PORT to the correct port before starting the service.
To get the right port check the VNS output for something similar to the following
[10/6/04 3:44 PM | main] Serving on port : 9000
It is printed by the VNS on startup.
There are several ways to exploit parallelism to increase performance in the UIMA Framework. These range from running with additional threads within one Java virtual machine on one host (which might be a multi-processor or hyper-threaded host) to deploying analysis engines on a set of remote machines.
The Collection Processing facility in UIMA provides the
ability to scale the pipe-line of analysis engines. This scale-out runs multiple threads within
the Java virtual machine running the CPM, one for each pipe in the
pipe-line. To activate it, in the <casProcessors>
descriptor element, set the attribute
processingUnitThreadCount
, which specifies the
number of replicated processing pipelines, to a value greater than 1, and
insure that the size of the CAS pool is equal to or greater than this number
(the attribute of <casProcessors>
to set is casPoolSize
). For
more details on these settings, see CAS
Processors .
For deployments that incorporate remote analysis engines in the Collection Manager pipe-line, running on multiple remote hosts, scale-out is supported which uses the Vinci naming service. If multiple instances of a service with the same name, but running on different hosts, are registered with the Vinci Name Server, it will assign these instances to incoming requests.
There are two modes supported: a "random" assignment, and a "exclusive" one. The "random" mode distributes load using an algorithm that selects a service instance at random. The UIMA framework supports this only for the case where all of the instances are running on unique hosts; the framework does not support starting 2 or more instances on the same host.
The exclusive mode dedicates a particular remote instance to each Collection Manager pip-line instance. This mode is enabled by adding a configuration parameter in the <casProcessor> section of the CPE descriptor:
<deploymentParameters>
<parameter
name="service-access" value="exclusive" />
</deploymentParameters>
If this is not specified, the "random" mode is used.
In addition, remote UIMA engine services can be started
with a parameter that specifies the number of instances the service should
support (see the <parameter name="numInstances">
xml element in remote deployment descriptor . Specifying
more than one causes the service wrapper for the analysis engine to use
multi-threading (within the single Java Virtual Machine – which can take
advantage of multi-processor and hyper-threaded architectures).
The UIMA SDK v2.0 supports remote monitoring of Analysis Engine performance via the Java Management Extensions (JMX) API. JMX is a standard part of the Java Runtime Environment v5.0. When you run a UIMA application under Java 5.0, the UIMA framework will automatically detect the presence of JMX and will register MBeans that provide access to the performance statistics.
To enable remote monitoring of these performance statistics, when you start your UIMA application specify the following JVM parameter:
-Dcom.sun.management.jmxremote
Now, you can use any JMX client to view the
statistics. JDK 5.0 provides a standard
client that you can use. Simply open a
command prompt, make sure the JDK 5.0 bin
directory
is in your path, and execute the jconsole
command. This should bring up the
following window:
Here you can choose from among your JMX-enabled applications that are currently running. Select a UIMA application from the list and click "Connect". The next screen will show a summary of information about the Java process that you connected to. Click on the "MBeans" tab, then expand "com.ibm.uima" in the tree at the left. You should see a view like this:
Each of the nodes under "com.ibm.uima
"
in the tree represents one of the UIMA Analysis Engines in the application that
you connected to. You can select one of
the analysis engines to view its performance statistics in the view at the
right.
Probably the most useful statistic is "CASes Per Second", which is the number of CASes that this AE has processed divided by the amount of time spent in the AE's process method, in seconds. Note that this is the total elapsed time, not CPU time. Even so, it can be useful to compare the "CASes Per Second" numbers of all of your Analysis Engines to discover where the bottlenecks occur in your application.
The AnalysisTime
, BatchProcessCompleteTime
, and CollectionProcessCompleteTime
properties show the total elapsed time, in milliseconds, that has been spent in
the AnalysisEngine's process()
, batchProcessComplete()
, and collectionProcessComplete()
methods, respectively. (Note that for CAS
Multipliers, time spent in the hasNext()
and next()
methods is also counted towards the AnalysisTime.)
Note that once your UIMA application terminates, you can no longer view the statistics through the JMX console. If you want to use JMX to view processes that have completed, you will need to write your application so that the JVM remains running after processing completes, waiting for some user signal before terminating.
For information on JMX see http://java.sun.com/j2se/1.5.0/docs/api/javax/management/package-summary.html#package_description.