Application Developer’s Guide

This chapter describes how to develop an application using the Unstructured Information Management Architecture (UIMA). The term application describes a program that provides end-user functionality. A UIMA application incorporates one or more UIMA components such as Analysis Engines, Collection Processing Engines, a Search Engine, and/or a Document Store and adds application-specific logic and user interfaces.

An application developer's starting point for accessing UIMA framework functionality is the com.ibm.uima.UIMAFramework class. The following is a short introduction to some important methods on this class. Several of these methods are used in examples in the rest of this chapter. For more details, see the JavaDocs (in the docs/api directory of the UIMA SDK).

  • UIMAFramework.getXMLParser(): Returns an instance of the UIMA XML Parser class, which then can be used to parse the various types of UIMA component descriptors. Examples of this can be found in the remainder of this chapter.
  • UIMAFramework.produceXXX(ResourceSpecifier): There are various produce methods that are used to create different types of UIMA components from their descriptors. The argument type, ResourceSpecifier, is the base interface that subsumes all types of component descriptors in UIMA. You can get a ResourceSpecifier from the XMLParser. Examples of produce methods are:
    • produceAnalysisEngine
    • produceCasConsumer
    • produceCasInitializer
    • produceCollectionProcessingEngine
    • produceCollectionReader
    • There are other variations of each of these methods that take additional, optional arguments. See the JavaDocs for details.
  • UIMAFramework.getLogger(<optional-logger-name>): Gets a reference to the UIMA Logger, to which you can write log messages. If no logger name is passed, the name of the returned logger instance is "com.ibm.uima".
  • UIMAFramework.getVersionString(): Gets the number of the UIMA version you are using.
  • UIMAFramework.newDefaultResourceManager(): Gets an instance of the UIMA ResourceManager. The key method on ResourceManager is setDataPath, which allows you to specify the location where UIMA components will go to look for their external resource files. Once you've obtained and initialized a ResourceManager, you can pass it to any of the produceXXX methods.

This section describes how to add analysis capability to your application by using Analysis Engines developed using the UIMA SDK. An Analysis Engine (AE) is a component that analyzes artifacts (e.g. documents) and infers information about them.

An Analysis Engine consists of two parts - Java classes (typically packaged as one or more JAR files) and AE descriptors (one or more XML files). You must put the Java classes in your application’s class path, but thereafter you will not need to directly interact with them. The UIMA framework insulates you from this by providing a standard AnalysisEngine interfaces.

The term Text Analysis Engine (TAE) is sometimes used to describe an Analysis Engine that analyzes a text document. In the UIMA SDK v1.x, there was a TextAnalysisEngine interface that was commonly used. However, as of the UIMA SDK v2.0, this interface has been deprecated and all applications should switch to using the standard AnalysisEngine interface.

The AE descriptor XML files contain the configuration settings for the Analysis Engine as well as a description of the AE’s input and output requirements. You may need to edit these files in order to configure the AE appropriately for your application - the supplier of the AE may have provided documentation (or comments in the XML descriptor itself) about how to do this.

Instantiating an Analysis Engine

The following code shows how to instantiate an AE from its XML descriptor:

{ //get Resource Specifier from XML file or PEAR XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in);

//create AE here AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier); }

The first two lines parse the XML descriptor (for AEs with multiple descriptor files, one of them is the "main" descriptor - the AE documentation should indicate which it is). The result of the parse is a ResourceSpecifier object. The third line of code invokes a static factory method UIMAFramework.produceAnalysisEngine, which takes the specifier and instantiates an AnalysisEngine object.

There is one caveat to using this approach - the Analysis Engine instance that you create will not support multiple threads running through it concurrently. If you need to support this, see section 6.2.6 .

Analyzing Text Documents

There are two ways to use the AE interface to analyze documents. You can either use the JCas interface, which is described in detail by Chapter 27, JCas Reference or you can directly use the CAS interface, which is described in detail in Chapter 26, CAS Reference Besides text documents, other kinds of artifacts can also be analyzed; see Chapter 8, Annotations, Artifacts, and Sofas for more information.

The basic structure of your application will look similar in both cases:

Using the JCas

{ //create a JCas, given an Analysis Engine (ae) JCas jcas = ae.newDefaultTextJCas();

// this is shorthand for the following steps: // CAS aCas = ae.newCAS(); // CAS aCasView = aCas.createDefaultTextView(); // JCas jcas = aCasView.createJCas();

//analyze a document jcas.setDocumentText(doc1text); ae.process(jcas); doSomethingWithResults(jcas); jcas.reset();

//analyze another document jcas.setDocumentText(doc2text); ae.process(jcas); doSomethingWithResults(jcas); jcas.reset(); ... //done ae.destroy(); }

Using the CAS

{ //create a CAS CAS aCasView = ae.newDefaultTextCAS();

// this is shorthand for the following steps: // CAS aCas = ae.newCAS(); // CAS aCasView = aCas.createDefaultTextView();

//analyze a document aCasView.setDocumentText(doc1text); ae.process(aCasView); doSomethingWithResults(aCasView); aCasView.reset();

//analyze another document aCasView.setDocumentText(doc2text); ae.process(aCasView); doSomethingWithResults(aCasView); aCasView.reset(); ... //done ae.destroy(); }

First, you create the CAS or JCas that you will use. Then, you repeat the following four steps for each document:

  • Put the document text into the CAS or JCas.
  • Call the AE's process method, passing the CAS or JCas as an argument
  • Do something with the results that the AE has added to the CAS or JCas
  • Call the CAS's or JCas's reset() method to prepare for another analysis

Analyzing Non-Text Artifacts

Analyzing non-text artifacts is similar to analyzing text documents. The main difference is that instead of using the setDocumentText method, you need to use the Sofa APIs to create an artifact plus (perhaps multiple) views of it. See Annotations, Artifacts, and Sofas for details.

Accessing Analysis Results using the JCas

See:

Accessing Analysis Results using the CAS

Analysis results are accessed using the CAS Indexes. You obtain iterators over specified types; the iterator returns the matching elements one at time from the CAS. For an example of this, see:

  • Chapter 26, CAS Reference
  • The source code for com.ibm.uima.examples.PrintAnnotations, which is in docs\examples\src.
  • The JavaDocs for the com.ibm.uima.cas and com.ibm.uima.cas.text packages.

Multi-threaded Applications

The simplest way to use an AE in a multi-threaded environment is to use the Java synchronized keyword to ensure that only one thread is using an AE at any given time. For example:

public class MyApplication { private AnalysisEngine mAnalysisEngine; private CAS mCAS; public MyApplication() { //get Resource Specifier from XML file or PEAR XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in); //create Analysis Engine here mAnalysisEngine = UIMAFramework.produceAnalysisEngine(specifier); mCAS = mAnalysisEngine.newDefaultTextCAS(); } // Assume some other part of your multi-threaded application could // call "analyzeDocument" on different threads, asynchronusly public synchronized void analyzeDocument(String aDoc) { //analyze a document mCAS.setDocumentText(aDoc); mAnalysisEngine.process(); doSomethingWithResults(mCAS); mCAS.reset(); } ... }

Without the synchronized keyword, this application would not be thread-safe. If multiple threads called the analyzeDocument method simultaneously, they would both use the same CAS and clobber each others' results. The synchronized keyword ensures that no more than one thread is executing this method at any given time. For more information on thread synchronization in Java, see http://java.sun.com/docs/books/tutorial/essential/threads/multithreaded.html.

The synchronized keyword ensures thread-safety, but does not allow you to process more than one document at a time. If you need to process multiple documents simultaneously (for example, to make use of a multiprocessor machine), you’ll need to use more than one CAS instance.

Because CAS instances use memory and can take some time to construct, you don't want to create a new CAS instance for each request. Instead, you should use a feature of the UIMA SDK called the CAS Pool, implemented by the type CasPool.

A CAS Pool contains some number of CAS instances (you specify how many when you create the pool). When a thread wants to use a CAS, it checks out an instance from the pool. When the thread is done using the CAS, it must release the CAS instance back into the pool. If all instances are checked out, additional threads will block and wait for an instance to become available. Here is some example code:

public class MyApplication { private CasPool mCasPool;

private AnalysisEngine mAnalysisEngine;

public MyApplication() { //get Resource Specifier from XML file or PEAR XMLInputSource in = new XMLInputSource("MyDescriptor.xml"); ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in); //create multithreadable AE that will //accept 3 simultaneous requests mAnalysisEngine = UIMAFramework.produceAnalysisEngine(specifier,3); //create CAS pool with 3 CAS instances mCasPool = new CasPool(mAnalysisEngine,3); } public void analyzeDocument(String aDoc) { //check out a CAS instance (argument 0 means no timeout) CAS cas = mCasPool.getCas(0); try { //analyze a document cas.setDocumentText(aDoc); mAnalysisEngine.process(cas); doSomethingWithResults(cas); } finally { //MAKE SURE we release the CAS instance mCasPool.releaseCas(cas); } } ... }

There is not much more code required here than in the previous example. First, there is one additional parameter to the AnalysisEngine producer, specifying the number of annotator instances to create Both the UIMA Collection Processing Manager framework and the remote deployment services framekwork have implementations which use CAS pools in this manner, and thereby relieve the annotator developer of the necessity to make their annotators thread-safe.. Then, instead of creating a single CAS in the constructor, we now create a CasPool containing 3 instances. In the analyze method, we check out a CAS, use it, and then release it.

  • Frequently, the two numbers (number of CASes, and the number of AEs) will be the same. It would not make sense to have the number of CASes less than the number of AEs – the extra AE instances would always block waiting for a CAS from the pool. It could make sense to have additional CASes, though – if you had other multi-threaded processes that were using the CASes, other than the AEs.

The getCAS() method returns a CAS which is not specialized to any particular subject of analysis. To process things other than this, please refer to Annotations, Artifacts, and Sofas

Note the use of the try...finally block. This is very important, as it ensures that the CAS we have checked out will be released back into the pool, even if the analysis code throws an exception. You should always use try...finally when using the CAS pool; if you do not, you risk exhausting the pool and causing deadlock.

The parameter 0 passed to the CasPool.getCas() method is a timeout value. If this is set to a positive integer, it is the maximum number of milliseconds that the thread will wait for an instance to become available in the pool. If this time elapses, the getCas method will return null, and the application can do something intelligent, like ask the user to try again later. A value of 0 will cause the thread to wait forever.

Using Multiple Analysis Engines (and creating shared CASes)

In most cases, the easiest way to use multiple Analysis Engines from within an application is to combine them into an aggregate AE. For instructions, see section 4.3, Building Aggregate Analysis Engines. Be sure that you understand this method before deciding to use the more advanced feature described in this section.

If you decide that your application does need to instantiate multiple AEs and have those AEs share a single CAS, then you will no longer be able to use the various methods on the AnalysisEngine class that create CASes (or JCases) to create your CAS. This is because these methods create a CAS with a data model specific to a single AE and which therefore cannot be shared by other AEs. Instead, you create a CAS as follows:

Suppose you have two analysis engines, and one CAS Consumer, and you want to create one type system from the merge of all of their type specifications. Then you can do the following:

AnalysisEngineDescription aeDesc1 = UIMAFramework.getXMLParser().parseAnalysisEngineDescription(...);

AnalysisEngineDescription aeDesc2 = UIMAFramework.getXMLParser().parseAnalysisEngineDescription(...);

CasConsumerDescription ccDesc = UIMAFramework.getXMLParser().parseCasConsumerDescription(...);

List list = new ArrayList();

list.add(aeDesc1); list.add(aeDesc2); list.add(ccDesc);

CAS cas = CasCreationUtils.createCas(list);

// once you have this CAS, you need to create the view you want of it, and // also (optionally) the JCas Interface to it

CAS casView = cas.createView("mySofaName", mime-type); // (OR) CAS casView = cas.createDefaultText View(); // (optional) JCas jcas = casView.getJCas();

The CasCreationUtils class takes care of the work of merging the AEs' type systems and producing a CAS for the combined type system. If the type systems are not compatible, an exception will be thrown.

Saving CASes to file systems

The UIMA framework provides APIs to save and restore the contents of a CAS to streams. The CASes are stored in an XML format. There are two forms of this format. The preferred form is the XMI form (see Using XMI CAS Serialization ). An older format is also available, called XCAS.

To save an XMI representation of a CAS, use the method com.ibm.uima.util.XmlCasSerializer. To save an XCAS representation of a CAS, use the method com.ibm.uima.cas.impl.XCASSerializer.serialize; see the JavaDocs (page 25-347) for details.

Both of these external forms can be read back in, using the com.ibm.uima.util.XmlCasDeserializer method. This method deserializes into a pre-existing CAS, which you must create ahead of time, pre set up with the proper type system. See the JavaDocs for details.

A Collection Processing Engine (CPE) processes collections of artifacts (documents) through the combination of the following components: a Collection Reader, an optional CAS Initializer, Analysis Engines, and CAS Consumers. Collection Processing Engines and their components are described in Chapter 5, Collection Processing Engine Developer's Guide.

Like Analysis Engines, CPEs consist of a set of Java classes and a set of descriptors. You need to make sure the Java classes are in your classpath, but otherwise you only deal with descriptors.

Running a CPE from a Descriptor

Section 5.3, Running a CPE from Your Own Java Application describes how to use the APIs to read a CPE descriptor and run it from an application.

Configuring a CPE Descriptor Programmatically

For the finest level of control over the CPE descriptor settings, the CPE offers programmatic access to the descriptor via an API. With this API, a developer can create a complete descriptor and then save the result to a file. This also can be used to read in a descriptor (using XMLParser.parseCpeDescription as shown in the previous section), modify it, and write it back out again. The CPE Descriptor API allows a developer to redefine default behavior related to error handling for each component, turn-on check-pointing, change performance characteristics of the CPE, and plug-in a custom timer.

Below is some example code that illustrates how this works. See the JavaDocs for package com.ibm.uima.collection.metadata for more details.

//Creates descriptor with default settings CpeDescription cpe = CpeDescriptorFactory.produceDescriptor();

//Add CollectionReader cpe.addCollectionReader([descriptor]);

//Add CasInitializer cpe.addCasInitializer(<cas initializer descriptor>);

// Provide the number of CASes the CPE will use

cpe.setCasPoolsSize(2);

// Define and add Analysis Engine CpeIntegratedCasProcessor personTitleProcessor = CpeDescriptorFactory.produceCasProcessor ("Person");

// Provide descriptor for the Analysis Engine personTitleProcessor.setDescriptor([descriptor]);

//Continue, despite errors and skip bad Cas personTitleProcessor.setActionOnMaxError("terminate");

//Increase amount of time in ms the CPE waits for response //from this Analysis Engine personTitleProcessor.setTimeout(100000);

//Add Analysis Engine to the descriptor cpe.addCasProcessor(personTitleProcessor); // Define and add CAS Consumer CpeIntegratedCasProcessor consumerProcessor = CpeDescriptorFactory.produceCasProcessor("Printer"); consumerProcessor.setDescriptor([descriptor]);

//Define batch size consumerProcessor.setBatchSize(100);

//Terminate CPE on max errors personTitleProcessor.setActionOnMaxError("terminate");

//Add CAS Consumer to the descriptor cpe.addCasProcessor(consumerProcessor);

// Add Checkpoint file and define checkpoint frequency (ms) cpe.setCheckpoint("[path]/checkpoint.dat", 3000);

// Plug in custom timer class used for timing events cpe.setTimer("com.ibm.uima.reference_impl.util.JavaTimer");

// Define number of documents to process cpe.setNumToProcess(1000);

// Dump the descriptor to the System.out ((CpeDescriptionImpl)cpe).toXML(System.out);


The CPE descriptor for the above configuration looks like this:

<?xml version="1.0" encoding="UTF-8"?> <cpeDescription xmlns="http://uima.apache.org/resourceSpecifier"> <collectionReader> <collectionIterator> <descriptor> <include href="[descriptor]"/> </descriptor> <configurationParameterSettings>...</configurationParameterSettings> </collectionIterator>

<casInitializer> <descriptor> <include href="[descriptor]"/> </descriptor> <configurationParameterSettings>...</configurationParameterSettings> </casInitializer> </collectionReader>

<casProcessors casPoolSize="2" processingUnitThreadCount="1"> <casProcessor deployment="integrated" name="Person"> <descriptor> <include href="[descriptor]"/> </descriptor> <deploymentParameters/> <errorHandling> <errorRateThreshold action="terminate" value="100/1000"/> <maxConsecutiveRestarts action="terminate" value="30"/> <timeout max="100000"/> </errorHandling> <checkpoint batch="100" time="1000ms"/> </casProcessor>

<casProcessor deployment="integrated" name="Printer"> <descriptor> <include href="[descriptor]"/> </descriptor> <deploymentParameters/> <errorHandling> <errorRateThreshold action="terminate" value="100/1000"/> <maxConsecutiveRestarts action="terminate" value="30"/> <timeout max="100000" default="-1"/> </errorHandling> <checkpoint batch="100" time="1000ms"/> </casProcessor> </casProcessors>

<cpeConfig> <numToProcess>1000</numToProcess> <deployAs>immediate</deployAs> <checkpoint file="[path]/checkpoint.dat" time="3000ms"/> <timerImpl> com.ibm.uima.reference_impl.util.JavaTimer</timerImpl> </cpeConfig> </cpeDescription>

Configuration parameters can be set using APIs as well as configured using the XML descriptor metadata specification (see Configuration Parameters ).

There are two different places you can set the parameters via the APIs.

  • After reading the XML descriptor for a component, but before you produce the component itself, and
  • After the component has been produced.

Setting the parameters before you produce the component is done using the ConfigurationParameterSettings object. You get an instance of this for a particular component by accessing that component description's metadata. For instance, if you produced a component description by using UIMAFramework.getXMLParser().parse... method, you can use that component description's getMetaData() method to get the metadata, and then the metadata's getConfigurationParameterSettings method to get the ConfigurationParameterSettings object. Using that object, you can set individual parameters using the setParameterValue method. Here's an example, for a CAS Consumer component:

// Create a description object by reading the XML for the descriptor

CasConsumerDescription casConsumerDesc =
UIMAFramework.getXMLParser().parseCasConsumerDescription(
new XMLInputSource("descriptors/cas_consumer/InlineXmlCasConsumer.xml"));

// get the settings from the metadata
ConfigurationParameterSettings consumerParamSettings =
casConsumerDesc.getMetaData().getConfigurationParameterSettings();

// Set a parameter value
consumerParamSettings.setParameterValue(
InlineXmlCasConsumer.PARAM_OUTPUTDIR, outputDir.getAbsolutePath());

Then you might produce this component using:

CasConsumer component = UIMAFramework.produceCasConsumer(casConsumerDesc);

A side effect of producing a component is calling the component's "initialize" method, allowing it to read its configuration parameters. If you want to change parameters after this, use

component.setConfigParameterValue("<parameter-name>", "<parameter-value>");

and then signal the component to re-read its configuration by calling the component's reconfigure method:

component.reconfigure();

Although these examples are for a CAS Consumer component, the parameter APIs also work for other kinds of components.

The UIMA SDK includes a search engine that you can use to build a search index that includes the results of the analysis done by your AE. This combination of AEs with a search engine capable of indexing both words and annotations over spans of text enables what UIMA refers to as semantic search.

Semantic search is a search where the semantic intent of the query is specified using one or more entity or relation specifiers. For example, one could specify that they are looking for a person (named) "Bush." Such a query would then not return results about the kind of bushes that grow in your garden.

Indexing

To build a semantic search index using the UIMA SDK, you run a Collection Processing Engine that includes your AE along with a CAS Consumer called the Semantic Search CAS Indexer, which is provided with the UIMA SDK. Your AE must include an annotator that produces Tokens and Sentence annotations, along with any "semantic" annotations, because the Indexer requires this. The Semantic Search CAS Indexer's descriptor is located at: docs/examples/descriptors/cas_consumer/SemanticSearchCasIndexer.xml.

Configuring the Semantic Search CAS Indexer

Since there are several ways you might want to build a search index from the information in the CAS produced by your AE, you need to supply the Semantic Search CAS Indexer with configuration information in the form of an Index Build Specification file. An example of an Indexing specification tailored to the AE from the tutorial in the Chapter 4, Annotator and Analysis Engine Developer’s Guide is located in docs/examples/descriptors/tutorial/search/MeetingIndexBuildSpec.xml. It looks like this:

<indexBuildSpecification> <indexBuildItem> <name>com.ibm.uima.examples.tokenizer.Token</name> <indexRule> <style name="Term"/> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.examples.tokenizer.Sentence</name> <indexRule> <style name="Breaking"/> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.tutorial.Meeting</name> <indexRule> <style name="Annotation"/> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.tutorial.RoomNumber</name> <indexRule> <style name="Annotation"> <attributeMappings> <mapping> <feature>building</feature> <indexName>building</indexName> </mapping> </attributeMappings> </style> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.tutorial.DateAnnot</name> <indexRule> <style name="Annotation"/> </indexRule> </indexBuildItem> <indexBuildItem> <name>com.ibm.uima.tutorial.TimeAnnot</name> <indexRule> <style name="Annotation"/> </indexRule> </indexBuildItem> </indexBuildSpecification>

The index build specification is a series of index build items, each of which identifies a CAS annotation type (a subtype of uima.tcas.Annotation – see Chapter 26, CAS Reference) and a style.

The first item in this example specifies that the annotation type com.ibm.uima.examples.tokenizer.Token should be indexed with the "Term" style. This means that each span of text annotated by a Token will be considered a single token for standard text search purposes.

The second item in this example specifies that the annotation type com.ibm.uima.examples.tokenizer.Sentence should be indexed with the "Breaking" style. This means that each span of text annotated by a Sentence will be considered a single sentence, which can affect that search engine's algorithm for matching queries. The semantic search engine always requires tokens and sentences in order to index a document.

  • Requirements for Term and Breaking rules: The Semantic Search indexer supplied with the UIMA SDK requires that the items to be indexed as words be designated using the Term rule.

The remaining items all use the "Annotation" style. This indicates that each annotation of the specified types will be stored in the index as a searchable span, with a name equal to the annotation name (without the namespace).

Also, features of annotations can be indexed using the <attributeMappings> subelement. In the example index build specification, we declare that the building feature of the type com.ibm.uima.tutorial.RoomNumber should be indexed. The <indexName> element can be used to map the feature name to a different name in the index, but in this example we have opted to use the same name, building.

At the end of the batch or collection, the Semantic Search CAS Indexer builds the index. This index can be queried with simple tokens or with xml tags

Examples :

  • A query on the word "UIMA" will retrieve all documents that have the occurrence of the word. But a query of the type <Meeting>UIMA</Meeting> will retrieve only those documents that contain a Meeting annotation (produced by our MeetingDetector TAE, for example), where that Meeting annotation contains the word "UIMA".
  • A query for <RoomNumber building="Yorktown"/> will return documents that have a RoomNumber annotation whose building feature contains the term "Yorktown".

More information on the syntax of these kinds of queries, called XML Fragments, can be found in Chapter 28, Semantic Search Engine Reference .

For more information on the Index Build Specification format, see the UIMA JavaDocs for class com.ibm.uima.search.IndexBuildSpecification. Accessing the JavaDocs is described .

Building and Running a CPE including the Semantic Search CAS Indexer

The following steps illustrate how to build and run a CPE that uses the UIMA Meeting Detector TAE and the Simple Token and Sentence Annotator, discussed in the Chapter 4, Annotator and Analysis Engine Developer’s Guide along with the Semantic Search CAS Indexer, to build an index that allows you to query for documents based not only on textual content but also on whether they contain mentions of Meetings detected by the TAE.

Run the CPE Configurator tool by executing the cpeGui shell script in the bin directory of the UIMA SDK. (For instructions on using this tool, see the Chapter 13, Collection Processing Engine Configurator User's Guide.)

In the CPE Configurator tool, select the following components by browsing to their descriptors:

  • Collection Reader: %UIMA_HOME%/docs/examples/descriptors/collectionReader/ FileSystemCollectionReader.xml
  • Analysis Engine: include both of these; one produces tokens/sentences, required by the indexer in all cases and the other produces the meeting annotations of interest.
    %UIMA_HOME%/docs/examples/descriptors/analysis_engine/
    SimpleTokenAndSentenceAnnotator.xml
  • and
    %UIMA_HOME%/docs/examples/descriptors/tutorial/ex6/ UIMAMeetingDetectorTAE.xml
  • Two CAS Consumers:

%UIMA_HOME%/docs/examples/descriptors/casConsumer/ SemanticSearchCasIndexer.xml

%UIMA_HOME%/docs/examples/descriptors/casConsumer/ XCasWriterCasConsumer.xml

Set up parameters :

  • Set the File System Collection Reader's "Input Directory" parameter to point to the %UIMA_HOME%/docs/examples/data directory.
  • Set the Semantic Search CAS Indexer's "Indexing Specification Descriptor" parameter to point to %UIMA_HOME%/docs/examples/descriptors/tutorial/search/
    MeetingIndexBuildSpec.xml
  • Set the Semantic Search CAS Indexer's "Index Dir" parameter to whatever directory into which you want the indexer to write its index files.
    WARNING: The Indexer erases old versions of the files it creates in this directory.
  • Set the XCAS Writer CAS Consumer's "Output Directory" parameter to whatever directory into which you want to store the XCAS files containing the results of your analysis for each document.

Click on the Run Button. Once the run completes, a statistics dialog should appear, in which you can see how much time was spent in each of the components involved in the run.

Semantic Search Query Tool

The UIMA SDK contains a simple tool for running queries against a semantic search index. After building an index as described in the previous section, you can launch this tool by running the shell script: semanticSearch, found in the /bin subdirectory of the UIMA install, at the command prompt. If you are using Eclipse, and have installed the UIMA examples, there will be a Run configuration you can use to conveniently launch this, called UIMA Semantic Search. This will display the following screen:

Configure the first three fields on this screen as follows:

  • Set the "Index Directory" to the directory where you built your index. This is the same value that you supplied for the "Index Dir" parameter of the Semantic Search CAS Indexer in the CPE Configurator.
  • Set the "XCAS Directory" to the directory where you stored the XCAS files containing the results of your analysis. This is the same value that you supplied for the "Output Directory" parameter of XCAS Writer CAS Consumer in the CPE Configurator.
  • Set the "Type System Descriptor" to the location of the descriptor that describes your type system. For this example, this will be %UIMA_HOME%/docs/examples/ descriptors/tutorial/ex4/TutorialTypeSystem.xml

Now, in the "XML Fragments" field, you can type in single words or xml queries where the xml tags correspond to the labels in the index build specification file (e.g. <Meeting>UIMA</Meeting>). XML Fragments are described in Chapter 28 28-377.

After you enter a query and click the "Search" button, a list of hits will appear. Select one of the documents and click "View Analysis" to view the document in the UIMA Annotation Viewer.

The source code for the Semantic Search query program is in docs/examples/src/com/ibm/uima/examples/search/SemanticSearchGUI.java. A simple command-line query program is also provided in docs/examples/src/com/ibm/uima/examples/search/SemanticSearch.java. Using these as a model, you can build a query interface from your own application. For details on the Semantic Search Engine query language and interface, see Chapter 28, Semantic Search Engine Reference.

The UIMA SDK allows you to easily take any Analysis Engine or CAS Consumer and deploy it as a service. That Analysis Engine or CAS Consumer can then be called from a remote machine.

The UIMA SDK provides support for two communications protocols

  • SOAP, the standard Web Services protocol
  • Vinci, an IBM-developed, lightweight version of SOAP

The UIMA framework can make use of these services in two different ways:

  1. An Analysis Engine can create a proxy to a remote service; this proxy acts like a local component, but connects to the remote. The proxy has limited error handling and retry capabilities. Both Vinci and SOAP are supported.
  2. A Collection Processing Engine can specify non-Integrated mode (see Deploying a CPE ). The CPE provides more extensive error recovery capabilities. This mode only supports the Vinci communications protocol.

How to Deploy a UIMA Component as a SOAP Web Service

To deploy a UIMA component as a SOAP Web Service, you need to first install the following software components:

Later versions of these components will likely also work, but have not been tested.

Next, you need to do the following three setup steps:

  • Set the CATALINA_HOME environment variable set to the location where Tomcat is installed.
  • Copy all of the JAR files from %UIMA_HOME%/lib to the %CATALINA_HOME%/webapps/axis/WEB-INF/lib in your installation.
  • Copy your JAR files for the UIMA components that you wish to %CATALINA_HOME%/webapps/axis/WEB-INF/lib in your installation.
  • IMPORTANT: any time you add JAR files to TomCat (for instance, in the above 2 steps), you must shutdown and restart TomCat before it "notices" this. So now, please shutdown and restart TomCat.
  • All the Java classes for the UIMA Examples are packaged in the uima_examples.jar file which is included in the %UIMA_HOME%/lib folder.
  • In addition, if an annotator needs to locate resource files in the classpath, those resources must be available in the Axis classpath, so copy these also to %CATALINA_HOME%/webapps/axis/WEB-INF/classes.

    As an example, if you are deploying the GovernmentTitleRecognizer (found in docs/examples/descriptors/analysis_engine/ GovernmentOfficialRecognizer_RegEx_TAE) as a SOAP service, you need to copy the file docs/examples/resources/GovernmentTitlePatterns.dat into .../WEB-INF/classes.

Test your installation of Tomcat and Axis by starting Tomcat and going to http://localhost:8080/axis/happyaxis.jsp in your browser. Check to be sure that this reports that all of the required Axis libraries are present. One common missing file may be activation.jar, which you can get from java.sun.com.

After completing these setup instructions, you can deploy Analysis Engines or CAS Consumers as SOAP web services by using the deploytool utility, with is located in the /bin directory of the UIMA SDK. deploytool is a command line program utility that takes as an argument a web services deployment descriptors (WSDD file); example WSDD files are provided in the docs\examples\deploy\soap directory of the UIMA SDK. Deployment Descriptors have been provided for deploying and undeploying some of the example Analysis Engines that come with the SDK.

As an example, the WSDD file for deploying the example Person Title annotator looks like this (important parts are in bold italics):

<deployment name="PersonTitleAnnotator" xmlns="http://xml.apache.org/axis/wsdd/" xmlns:java="http://xml.apache.org/axis/wsdd/providers/java">

<service name="urn:PersonTitleAnnotator" provider="java:RPC">

<parameter name="scope" value="Request"/>

<parameter name="className" value="com.ibm.uima.reference_impl.analysis_engine.service.soap.AxisAnalysisEngineService_impl"/>

<parameter name="allowedMethods" value="getMetaData process"/> <parameter name="allowedRoles" value="*"/> <parameter name="resourceSpecifierPath" value="C:/Program Files/apache/ uima/docs/examples/descriptors/analysis_engine/PersonTitleAnnotator.xml"/>

<parameter name="numInstances" value="3"/> <parameter name="timeoutPeriod" value="30000"/>

<!-- Type Mappings omitted from this document; you will not need to edit them. -->

<typeMapping .../> <typeMapping .../> <typeMapping .../>

</service>

</deployment>

To modify this WSDD file to deploy your own Analysis Engine or CAS Consumer, just replace the areas indicated in bold italics (deployment name, service name, and resource specifier path) with values appropriate for your component.

The timeoutPeriod parameter only is used when there are multiple clients accessing the service. When a new request comes in, if the service is busy with other requests (all instances are busy, in the case where it has multiple instances), it waits for one to become available - and this parameter specifies the maximum time for that wait. If it takes longer than this, the service wrapper will throw an exception back to the client and abort the processing for this document on the service.

To deploy the Person Title annotator service, issue the following command:

C:Program FilesIBMuima>bindeploytool docsexamplesdeploysoapDeploy_PersonTitleAnnotator.wsdd

Test if the deployment was successful by starting up a browser, pointing it to your TomCat installation's "axis" webpage (e.g., http://localhost:8080/axis) and clicking on the List link. This should bring up a page which shows the deployed services, where you should see the service you just deployed.

The other components can be deployed by replacing Deploy_PersonTitleAnnotator.wsdd with one of the other Deploy descriptors in the deploy directory. The deploytool utility can also undeploy services when passed one of the Undeploy descriptors.

Note: The deploytool shell script assumes that the web services are to be installed at http://localhost:8080/axis. If this is not the case, you will need to update the shell script appropriately.

Once you have deployed your component as a web service, you may call it from a remote machine. See "How to Call a UIMA Service," below, for instructions.

How to Deploy a UIMA Component as a Vinci Service

There are no software prerequisites for deploying a Vinci service. The necessary libraries are part of the UIMA SDK. However, before you can use Vinci services you need to deploy the Vinci Naming Service (VNS), as described in section 6.6.5 .

To deploy a service, you have to insure any components you want to include can be found on the class path. One way to do this is to set the environment variable UIMA_CLASSPATH to the set of class paths you need for any included components. Then run the startVinciService shell script, which is located in the UIMA SDK bin directory, and pass it the path to a Vinci deployment descriptor, for example:

C:UIMA>binstartVinciService docsexamplesdeployvinciDeploy_PersonTitleAnnotator.xml

This example deployment descriptor looks like:

<deployment name="Vinci Person Title Annotator Service">

<service name="uima.annotator.PersonTitleAnnotator" provider="vinci">

<parameter name="serializerClassName" value="com.ibm.uima.reference_impl.analysis_engine.service.vinci.VinciXCASSerializer_NoDocText"/>

<parameter name="resourceSpecifierPath" value="C:/Program Files/apache-uima/docs/examples/descriptors/analysis_engine/PersonTitleAnnotator.xml"/>

<parameter name="numInstances" value="1"/>

<parameter name="timeoutPeriod" value="30000"/>

<parameter name="serverSocketTimeout" value="120000"/>

</service>

</deployment>

To modify this deployment descriptor to deploy your own Analysis Engine or CAS Consumer, just replace the areas indicated in bold italics (deployment name, service name, and resource specifier path) with values appropriate for your component.

The timeoutPeriod parameter only is used when there are multiple clients accessing the service. When a new request comes in, if the service is busy with other requests (all instances are busy, in the case where it has multiple instances), it waits for one to become available - and this parameter specifies the maximum time for that wait. If it takes longer than this, the service wrapper will throw an exception back to the client and abort the processing for this document on the service.

The serverSocketTimeout parameter specifies the number of milliseconds (default = 5 minutes) that the service will wait between requests to process something. After this amount of time, the server will presume the client may have gone away - and it "cleans up", releasing any resources it is holding. The next call to process on the service will result in a cycle which will cause the client to re-establish its connection with the service (some additional overhead).

The startVinciService script takes two additional optional parameters. The first one overrides the value of the VNS_HOST environment variable, allowing you to specify the name server to use. The second parameter if specified needs to be a unique (on this server) non-negative number, specifying the instance of this service. When used, this number allows multiple instances of the same named service to be started on one server; they will all register with the Vinci name service and be made available to client requests.

Once you have deployed your component as a web service, you may call it from a remote machine. See "How to Call a UIMA Service," below, for instructions.

How to Call a UIMA Service

Once an Analysis Engine or CAS Consumer has been deployed as a service, it can be used from any UIMA application, in the exact same way that a local Analysis Engine or CAS Consumer is used. For example, you can call an Analysis Engine service from the Document Analyzer or use the CPE Configurator to build a CPE that includes Analysis Engine and CAS Consumer services.

To do this, you use a service client descriptor in place of the usual Analysis Engine or CAS Consumer Descriptor. A service client descriptor is a simple XML file that indicates the location of the remote service and a few parameters. Example service client descriptors are provided in the UIMA SDK under the directories docs/examples/descriptors/soapService and docs/examples/descriptors/vinciService. The contents of these descriptors are explained below.

Also, before you can call a SOAP service, you need to have the necessary Axis JAR files in your classpath. If you use any of the scripts in the /bin directory of the UIMA installation to launch your application, such as documentAnalyzer, these JARs are added to the classpath, automatically, using the CATALINA_HOME environment variable. The required files are the following (all part of the Apache Axis download):

  • activation.jar
  • axis.jar
  • commons-discovery.jar
  • commons-logging.jar
  • jaxrpc.jar
  • saaj.jar.

SOAP Service Client Descriptor

The descriptor used to call the PersonTitleAnnotator SOAP service from the example above is:

<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> <resourceType>AnalysisEngine</resourceType> <uri>http://localhost:8080/axis/services/urn:PersonTitleAnnotator</uri> <protocol>SOAP</protocol> </uriSpecifier>

The <resourceType> element must contain either AnalysisEngine or CasConsumer. This specifies what type of component you expect to be at the specified service address.

The <uri> element describes which service to call. It specifies the host (localhost, in this example) and the service name (urn:PersonTitleAnnotator), which must match the name specified in the deployment descriptor used to deploy the service.

Vinci Service Client Descriptor

To call a Vinci service, a similar descriptor is used:

<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> <resourceType>AnalysisEngine</resourceType> <uri>uima.annot.PersonTitleAnnotator</uri> <protocol>Vinci</protocol> <timeout>60000</timeout> <parameters> <parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/> <parameter name="VNS_PORT" value="9000"/> </parameters> </uriSpecifier>

Note that Vinci uses a centralized naming server, so the host where the service is deployed does not need to be specified. Only a name (uima.annot.PersonTitleAnnotator) is given, which must match the name specified in the deployment descriptor used to deploy the service.

The host and/or port where your Vinci Naming Service (VNS) server is running can be specified by the optional <parameter> elements. If not specified, the value is taken from the specification given your Java command line (if present) using
-DVNS_HOST=<host> and -DVNS_PORT=<port> system arguments. If not specified on the Java command line, defaults are used: localhost for the VNS_HOST, and 9000 for the VNS_PORT. See the next section for details on setting up a VNS server.

Restrictions on remotely deployed services

Remotely deployed services are started on remote machines, using UIMA component descriptors on those remote machines. These descriptors supply any configuration and resource parameters for the service (configuration parameters are not transmitted from the calling instance to the remote one). Likewise, the remote descriptors supply the type system specification for the remote annotators that will be run (the type system of the calling instance is not transmitted to the remote one).

The remote service wrapper, when it receives a CAS from the caller, instantiates it for the remote service, making instances of all types which the remote service specifies. Other instances in the incoming CAS for types which the remote service has no type specification for are kept aside, and when the remote service returns the CAS back to the caller, these type instances are re-merged back into the CAS being transmitted back to the caller. Because of this design, a remote service which doesn't declare a type system won't receive any type instances.

  • This behavior may change in future releases, to one where configuration parameters and / or type systems are transmitted to remote services.

The Vinci Naming Service (VNS)

Vinci consists of components for building network-accessible services, clients for accessing those services, and an infrastructure for locating and managing services. The primary infrastructure component is the Vinci directory, known as VNS (for Vinci Naming Service).

On startup, Vinci services locate the VNS and provide it with information that is used by VNS during service discovery. Vinci service provides the name of the host machine on which it runs, and the name of the service. The VNS internally creates a binding for the service name and returns the port number on which the Vinci service will wait for client requests. This VNS stores its bindings in a filesystem in a file called vns.services.

In Vinci, services are identified by their service name. If there is more than one physical service with the same service name, then Vinci assumes they are equivalent and will route queries to them randomly, provided that they are all running on different hosts. You should therefore use a unique service name if you don't want to conflict with other services listed in whatever VNS you have configured jVinci to use.

Starting VNS

To run the VNS use the startVNS script found in the /bin directory of the UIMA installation.

  • VNS runs on port 9000 by default so please make sure this port is available. If you see the following exception:

    java.net.BindException: Address already in use: JVM_Bind

    it indicates that another process is running on port 9000. In this case, add the parameter -p <port> to the startVNS command, using <port> to specify an alternative port to use.

When started, the VNS produces output similar to the following:

[10/6/04 3:44 PM | main] WARNING: Config file doesn©t exist, creating a new empty config file! [10/6/04 3:44 PM | main] Loading config file : .vns.services [10/6/04 3:44 PM | main] Loading workspaces file : .vns.workspaces [10/6/04 3:44 PM | main] ==================================== (WARNING) Unexpected exception: java.io.FileNotFoundException: .vns.workspaces (The system cannot find the file specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(Unknown Source) at java.io.FileInputStream.<init>(Unknown Source) at java.io.FileReader.<init>(Unknown Source) at com.ibm.vinci.transport.vns.service.VNS.loadWorkspaces(VNS.java:339) at com.ibm.vinci.transport.vns.service.VNS.startServing(VNS.java:237) at com.ibm.vinci.transport.vns.service.VNS.main(VNS.java:179) [10/6/04 3:44 PM | main] WARNING: failed to load workspace. [10/6/04 3:44 PM | main] VNS Workspace : null [10/6/04 3:44 PM | main] Loading counter file : .vns.counter [10/6/04 3:44 PM | main] Could not load the counter file : .vns.counter [10/6/04 3:44 PM | main] Starting backup thread, using files .vns.services.bak and .vns.services [10/6/04 3:44 PM | main] Serving on port : 9000 [10/6/04 3:44 PM | Thread-0] Backup thread started [10/6/04 3:44 PM | Thread-0] Saving to config file : .vns.services.bak >>>>>>>>>>>>> VNS is up and running! <<<<<<<<<<<<<<<<< >>>>>>>>>>>>> Type ©quit© and hit ENTER to terminate VNS <<<<<<<<<<<<< [10/6/04 3:44 PM | Thread-0] Config save required 10 millis. [10/6/04 3:44 PM | Thread-0] Saving to config file : .vns.services [10/6/04 3:44 PM | Thread-0] Config save required 10 millis. [10/6/04 3:44 PM | Thread-0] Saving counter file : .vns.counter

  • Disregard the java.io.FileNotFoundException: .\vns.workspaces (The system cannot find the file specified). It is just a complaint not a serious problem. VNS Workspace is a feature of the VNS that is not critical. The important information to note is [10/6/04 3:44 PM | main] Serving on port : 9000

which states the actual port where VNS will listen for incoming requests. All Vinci services and all clients connecting to services must provide the VNS port on the command line IF the port is not a default. Again the default port is 9000. Please see section Launching Vinci Services below for details about the command line and parameters.

VNS Files

The VNS maintains two external files

vns.services

vns.counter

These files are generated by the VNS in the same directory where the VNS is launched from. Since these files may contain old information it is best to remove them before starting the VNS. This step ensures that the VNS has always the newest information and will not attempt to connect to a service that has been shutdown.

Launching Vinci Services

When launching Vinci service, you must indicate which VNS the service will connect to. A Vinci service is typically started using the script startVinciService, found in the /bin directory of the UIMA installation. The environmental variable VNS_HOST should be set to the name or IP address of the machine hosting the Vinci Naming Service. The default is localhost, the machine the service is deployed on. This name can also be passed as the second argument to the startVinciService script. The default port for VNS is 9000 but can be overriden with the VNS_PORT environmental variable.

If you write your own startup script, to define Vinci’s default VNS you must provide the following JVM parameters:

java -DVNS_HOST=localhost -DVNS_PORT=9000 ...

The above setting is for the VNS running on the same machine as the service. Of course one can deploy the VNS on a different machine and the JVM parameter will need to be changed to this:

java -DVNS_HOST=<host> -DVNS_PORT=9000 ...

where ‘<host>‘ is a machine name or its IP where the VNS is running.

Note: VNS runs on port 9000 by default. When you see the following exception:

(WARNING) Unexpected exception:

com.ibm.vinci.transport.ServiceDownException: VNS inaccessible: java.net.Connect

Exception: Connection refused: connect

then, perhaps the VNS is not running OR the VNS is running but it is using a different port. To correct the latter, set the environmental variable VNS_PORT to the correct port before starting the service.

To get the right port check the VNS output for something similar to the following

[10/6/04 3:44 PM | main] Serving on port : 9000

It is printed by the VNS on startup.

There are several ways to exploit parallelism to increase performance in the UIMA Framework. These range from running with additional threads within one Java virtual machine on one host (which might be a multi-processor or hyper-threaded host) to deploying analysis engines on a set of remote machines.

The Collection Processing facility in UIMA provides the ability to scale the pipe-line of analysis engines. This scale-out runs multiple threads within the Java virtual machine running the CPM, one for each pipe in the pipe-line. To activate it, in the <casProcessors> descriptor element, set the attribute processingUnitThreadCount, which specifies the number of replicated processing pipelines, to a value greater than 1, and insure that the size of the CAS pool is equal to or greater than this number (the attribute of <casProcessors> to set is casPoolSize). For more details on these settings, see CAS Processors .

For deployments that incorporate remote analysis engines in the Collection Manager pipe-line, running on multiple remote hosts, scale-out is supported which uses the Vinci naming service. If multiple instances of a service with the same name, but running on different hosts, are registered with the Vinci Name Server, it will assign these instances to incoming requests.

There are two modes supported: a "random" assignment, and a "exclusive" one. The "random" mode distributes load using an algorithm that selects a service instance at random. The UIMA framework supports this only for the case where all of the instances are running on unique hosts; the framework does not support starting 2 or more instances on the same host.

The exclusive mode dedicates a particular remote instance to each Collection Manager pip-line instance. This mode is enabled by adding a configuration parameter in the <casProcessor> section of the CPE descriptor:

<deploymentParameters>
<parameter name="service-access" value="exclusive" />
</deploymentParameters>

If this is not specified, the "random" mode is used.

In addition, remote UIMA engine services can be started with a parameter that specifies the number of instances the service should support (see the <parameter name="numInstances"> xml element in remote deployment descriptor . Specifying more than one causes the service wrapper for the analysis engine to use multi-threading (within the single Java Virtual Machine – which can take advantage of multi-processor and hyper-threaded architectures).

The UIMA SDK v2.0 supports remote monitoring of Analysis Engine performance via the Java Management Extensions (JMX) API. JMX is a standard part of the Java Runtime Environment v5.0. When you run a UIMA application under Java 5.0, the UIMA framework will automatically detect the presence of JMX and will register MBeans that provide access to the performance statistics.

To enable remote monitoring of these performance statistics, when you start your UIMA application specify the following JVM parameter:

-Dcom.sun.management.jmxremote

Now, you can use any JMX client to view the statistics. JDK 5.0 provides a standard client that you can use. Simply open a command prompt, make sure the JDK 5.0 bin directory is in your path, and execute the jconsole command. This should bring up the following window:

Here you can choose from among your JMX-enabled applications that are currently running. Select a UIMA application from the list and click "Connect". The next screen will show a summary of information about the Java process that you connected to. Click on the "MBeans" tab, then expand "com.ibm.uima" in the tree at the left. You should see a view like this:

Each of the nodes under "com.ibm.uima" in the tree represents one of the UIMA Analysis Engines in the application that you connected to. You can select one of the analysis engines to view its performance statistics in the view at the right.

Probably the most useful statistic is "CASes Per Second", which is the number of CASes that this AE has processed divided by the amount of time spent in the AE's process method, in seconds. Note that this is the total elapsed time, not CPU time. Even so, it can be useful to compare the "CASes Per Second" numbers of all of your Analysis Engines to discover where the bottlenecks occur in your application.

The AnalysisTime, BatchProcessCompleteTime, and CollectionProcessCompleteTime properties show the total elapsed time, in milliseconds, that has been spent in the AnalysisEngine's process(), batchProcessComplete(), and collectionProcessComplete() methods, respectively. (Note that for CAS Multipliers, time spent in the hasNext() and next() methods is also counted towards the AnalysisTime.)

Note that once your UIMA application terminates, you can no longer view the statistics through the JMX console. If you want to use JMX to view processes that have completed, you will need to write your application so that the JVM remains running after processing completes, waiting for some user signal before terminating.

For information on JMX see http://java.sun.com/j2se/1.5.0/docs/api/javax/management/package-summary.html#package_description.