Annotator and Analysis Engine Developer’s Guide

This chapter describes how to develop UIMA type systems, Annotators and Analysis Engines using the UIMA SDK. It is helpful to read the UIMA Conceptual Overview chapter for a review on these concepts.

An Analysis Engine (AE) is a program that analyzes artifacts (e.g. documents) and infers information from them. A TAE is a specialization of an Analysis Engine that analyzes a particular artifact, which is often, for example, a text document (but could be, in general, audio streams, etc.).

In the UIMA SDK, Analysis Engines are constructed from building blocks called Annotators. An annotator is a component that contains analysis logic. Annotators analyze an artifact (for example, a text document) and create additional data (metadata) about that artifact. It is a goal of UIMA that annotators need not be concerned with anything other than their analysis logic – for example the details of their deployment or their interaction with other annotators.

An Analysis Engine (AE) may contain a single annotator (this is referred to as a Primitive AE), or it may be a composition of others and therefore contain multiple annotators (this is referred to as an Aggregate AE). Primitive and aggregate AEs implement the same interface and can be used interchangeably by applications.

Annotators produce their analysis results in the form of typed Feature Structures, which are simply data structures that have a type and a set of (attribute, value) pairs. An annotation is a particular type of Feature Structure that is attached to a region of the artifact being analyzed (a span of text in a document, for example).

For example, an annotator may produce an Annotation over the span of text President Bush, where the type of the Annotation is Person and the attribute fullName has the value George W. Bush, and its position in the artifact is character position 12 through character position 26.

It is also possible for annotators to record information associated with the entire document rather than a particular span (these are considered Feature Structures but not Annotations).

All feature structures, including annotations, are represented in the UIMA Common Analysis Structure (CAS). The CAS is the central data structure through which all UIMA components communicate. Included with the UIMA SDK is an easy-to-use, native Java interface to the CAS called the JCas. The JCas represents each feature structure as a Java object; the example feature structure from the previous paragraph would be an instance of a Java class Person with getFullName() and setFullName() methods. Though the examples in this guide all use the JCas, it is also possible to directly access the underlying CAS system; for more information see Chapter 26, CAS Reference.

The remainder of this chapter will refer to the analysis of text documents and the creation of annotations that are attached to spans of text in those documents. Keep in mind that the CAS can represent arbitrary types of feature structures, and feature structures can refer to other feature structures. For example, you can use the CAS to represent a parse tree for a document. Also, the artifact that you are analyzing need not be a text document.

This guide is organized as follows:

Getting Started is a tutorial with step-by-step instructions for how to develop and test a simple UIMA annotator.

Configuration and Logging discusses how to make your UIMA annotator configurable, and how it can write messages to the UIMA log file.

Building Aggregate Analysis Engines describes how annotators can be combined into aggregate analysis engines. It also describes how one annotator can make use of the analysis results produced by an annotator that has run previously.

Other examples

The UIMA SDK include several other examples you may find interesting, including

  • SimpleTokenAndSentenceAnnotator – a simple tokenizer and sentence annotator.
  • PersonTitleDBWriterCasConsumer – a sample CAS Consumer which populates a relational database with some annotations. It uses JDBC and in this example, hooks up with the Open Source Apache Derby database.

Additional Topics describes additional features of the UIMA SDK that may help you in building your own annotators and analysis engines.

Common Pitfalls contains some useful guidelines to help you ensure that your annotators will work correctly in any UIMA application.

This guide does not discuss how to build UIMA Applications, which are programs that use Analysis Engines, along with other components, e.g. a search engine, document store, and user interface, to deliver a complete package of functionality to an end-user. For information on application development, see Chapter 6, Application Developer’s Guide

This section is a step-by-step tutorial that will get you started developing UIMA annotators. All of the files referred to by the examples in this chapter are in the docs/examples directory of the UIMA SDK. This directory is designed to be imported into your Eclipse workspace; see section 3.2, Setting up Eclipse to view Example Code for instructions on how to do this. Also you may wish to refer to the UIMA SDK JavaDocs located in the docs/api directory.

  • In Eclipse 3.1, if you highlight a UIMA class or method defined in the UIMA SDK JavaDocs, you can conveniently have Eclipse open the corresponding JavaDoc for that class or method in a browser, by pressing Shift + F2.

The example annotator that we are going to walk through will detect room numbers for rooms at the IBM T.J. Watson Research Center (where the UIMA SDK originated). There are two Watson buildings: Yorktown and Hawthorne, and each has its own pattern for room numbers. Here are some examples, together with their corresponding regular expression patterns:

Yorktown: 20-001, 31-206, 04-123(Pattern: ##-[0-2]##)

Hawthorne: GN-K35, 1S-L07, 4N-B21(Pattern: [G1-4][NS]-[A-Z]##)

There are several steps to develop and test a simple UIMA annotator.

  1. Define the CAS types that the annotator will use.
  2. Generate the Java classes for these types.
  3. Write the actual annotator Java code.
  4. Create the Analysis Engine descriptor.
  5. Test the annotator.

These steps are discussed in the next sections.

Defining Types

The first step in developing an annotator is to define the CAS Feature Structure types that it creates. This is done in an XML file called a Type System Descriptor. UIMA defines some basic built-in CAS types such as TOP, Boolean, Byte, Short, Integer, Long, Float, Double, Arrays of primitives and FSArray, and Annotation. TOP is the root of the type system, analogous to Object in Java. FSArray is an array of Feature Structures (i.e. an array of instances of TOP).

UIMA includes an Eclipse plug-in that will help you edit Type System Descriptors, so if you are using Eclipse you will not need to worry about the details of the XML syntax. See Chapter 3, UIMA SDK Setup for Eclipse for instructions on setting up Eclipse and installing the plugin.

The Type System Descriptor for our annotator is located in the file descriptors/tutorial/ex1/TutorialTypeSystem.xml. (This and all other examples are located in the docs/examples directory of the UIMA SDK, which can be imported into an Eclipse project for your convenience, as described in 3.2, Setting up Eclipse to view Example Code.)

In Eclipse, expand the uima_examples project in the Package Explorer view, and browse to the file descriptors/tutorial/ex1/TutorialTypeSystem.xml. Right-click on the file in the navigator and select Open With -> Component Descriptor Editor. Once the editor opens, click on the "Type System" tab at the bottom of the editor window. You should see a view such as the following:

Our annotator will need only one type – com.ibm.uima.tutorial.RoomNumber. (We use the same namespace conventions as are used for Java classes.) Just as in Java, types have supertypes. The supertype is listed in the second column of the left table. In this case our RoomNumber annotation extends from the built-in type uima.tcas.Annotation.

Descriptions can be included with types and features. In this example, there is a description associated with the building feature. To see it, hover the mouse over the feature.

The bottom tab labeled "Source" will show you the XML source file associated with this descriptor.

The built-in Annotation type declares two fields (called Features in CAS terminology) – begin and end. These features store the character offsets of the span of text to which the annotation refers. Our RoomNumber type will inherit these features from com.tcas.Annotation, its supertype; they are not visible in this view because inherited features are not shown. One additional feature, building, is declared. It takes a String as its value. Instead of String, we could have declared the range-type of our feature to be any other CAS type (defined or built-in).

If you are not using Eclipse, if you need to edit the type system, do so using any XML or text editor, directly. The following is the actual XML representation of the Type System displayed above in the editor:

<?xml version="1.0" encoding="UTF-8" ?> <typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier"> <name>TutorialTypeSystem</name> <description>Type System Definition for the tutorial examples - as of Exercise 1</description> <vendor>IBM</vendor> <version>1.0</version> <types> <typeDescription> <name>com.ibm.uima.tutorial.RoomNumber</name> <description></description> <supertypeName>uima.tcas.Annotation</supertypeName> <features> <featureDescription> <name>building</name> <description>Building containing this room</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription> </features> </typeDescription> </types> </typeSystemDescription>

Generating Java Source Files for CAS Types

When you save a descriptor that you have modified, the Component Descriptor Editor will automatically generate Java classes corresponding to the types that are defined in that descriptor (unless this has been disabled), using a utility called JCasGen. These Java classes will have the same name (including package) as the CAS types, and will have get and set methods for each of the features that you have defined.

This feature is enabled/disabled using the UIMA menu pulldown (or the Eclipse Preferences – UIMA). If automatic running of JCasGen is not happening, please make sure the option is checked:

The Java class for the example com.ibm.uima.tutorial.RoomNumber type can be found in src/com/ibm/uima/tutorial/RoomNumber.java. You will see how to use these generated classes in the next section.

If you are not using the Component Descriptor Editor, you will need to generate these Java classes by using the JCasGen tool. JCasGen reads a Type System Descriptor XML file and generates the corresponding Java classes that you can then use in your annotator code. To launch JCasGen, simply execute the jcasgen shell script located in the bin directory of the UIMA SDK. This should launch a GUI that looks something like this:

Use the "Browse" buttons to select your input file (TutorialTypeSystem.xml) and output directory (the root of the source tree into which you want the generated files placed). Then click the "Go" button. Assuming no errors in the Type System Descriptor, new Java source files should be generated under the specified output directory.

There are some additional options to choose from when running JCasGen; please refer to the Chapter 19, JCasGen User Guide for details.

Developing Your Annotator Code

Annotator implementations all implement a standard interface, having several methods, the most important of which are:

  • initialize,
  • process, and
  • destroy.

initialize is called by the framework once when it first creates the annotator. process is called once per item being processed. destroy may be called by the application when it is done. There is a default implementation of this interface for annotators using the JCas, called JTextAnnotator_ImplBase, which has implementations of all required methods except for the process method.

Our annotator class extends the JTextAnnotator_ImplBase; most annotators that use the JCas will extend from this class, so they only have to implement the process method. Even though this class name has the word "Text" in it, it is not restricted to handling just text; see Chapter 8, Annotations, Artifacts, and Sofas .

Annotators are not required to extend from the JTextAnnotator_ImplBase class; they may instead directly implement the JTextAnnotator interface, and provide all method implementations themselves. This allows you to have your annotator inherit from some other superclass if necessary. If you would like to do this, see the JavaDocs for JTextAnnotator for descriptions of the methods you must implement.

Annotator classes need to be public and have public, 0-argument constructors, so that they can be instantiated by the framework Although Java classes in which you do not define any constructor will, by default, have a 0-argument constructor that doesn't do anything, a class in which you have defined at least one constructor does not get a default 0-argument constructor..

The class definition for our RoomNumberAnnotator implements the process method, and is shown here. You can find the source for this in the uima_examples/src/com/ibm/uima/tutorial/ex1/RoomNumberAnnotator.java. Note: In Eclipse, in the "Package Explorer" view, this will appear by default in the project uima_examples, in the folder src, in the package com.ibm.uima.tutorial.ex1. In Eclipse, open the RoomNumberAnnotator.java in the uima_examples project, under the src directory.

package com.ibm.uima.tutorial.ex1; import java.util.regex.Matcher; import java.util.regex.Pattern; import com.ibm.uima.analysis_engine.ResultSpecification; import com.ibm.uima.analysis_engine.annotator.AnnotatorConfigurationException; import com.ibm.uima.analysis_engine.annotator.AnnotatorContext; import com.ibm.uima.analysis_engine.annotator.AnnotatorInitializationException; import com.ibm.uima.analysis_engine.annotator.AnnotatorProcessException; import com.ibm.uima.analysis_engine.annotator.JTextAnnotator_ImplBase; import com.ibm.uima.jcas.impl.JCas; import com.ibm.uima.tutorial.RoomNumber; /** * Example annotator that detects room numbers using Java 1.4 regular * expressions. */ public class RoomNumberAnnotator extends JTextAnnotator_ImplBase { private Pattern mYorktownPattern = Pattern.compile("\b[0-4]\d-[0-2]\d\d\b"); private Pattern mHawthornePattern = Pattern.compile("\b[G1-4][NS]-[A-Z]\d\d\b");; /** * @see JTextAnnotator#process(JCas,ResultSpecification) */ public void process(JCas aJCas, ResultSpecification aResultSpec) throws AnnotatorProcessException { // Discussed Later } }

The two Java class fields, mYorktownPattern and mHawthornePattern, hold regular expressions that will be used in the process method. Note that these two fields are part of the Java implementation of the annotator code, and not a part of the CAS type system. We are using the regular expression facility that is built into Java 1.4. It is not critical that you know the details of how this works, but if you are curious the details can be found in the Java API docs for the java.util.regex package.

The only method that we are required to implement is process. This method is typically called once for each document that is being analyzed. This method takes two arguments. The JCas holds the document to be analyzed and all of the analysis results. We'll ignore the ResultSpecification for now; its use is not required.

/** * @see JTextAnnotator#process(JCas,ResultSpecification) */ public void process(JCas aJCas, ResultSpecification aResultSpec) throws AnnotatorProcessException { //get document text String docText = aJCas.getDocumentText(); //search for Yorktown room numbers Matcher matcher = mYorktownPattern.matcher(docText); int pos = 0; while (matcher.find(pos)) { //found one - creation annotation RoomNumber annotation = new RoomNumber(aJCas); annotation.setBegin(matcher.start()); annotation.setEnd(matcher.end()); annotation.setBuilding("Yorktown"); annotation.addToIndexes(); pos = matcher.end(); } //search for Hawthorne room numbers matcher = mHawthornePattern.matcher(docText); pos = 0; while (matcher.find(pos)) { //found one - creation annotation RoomNumber annotation = new RoomNumber(aJCas); annotation.setBegin(matcher.start()); annotation.setEnd(matcher.end()); annotation.setBuilding("Hawthorne"); annotation.addToIndexes(); pos = matcher.end(); } }

The Matcher class is part of the java.util.regex package and is used to find the room numbers in the document text. When we find one, recording the annotation is as simple as creating a new Java object and calling some set methods:

RoomNumber annotation = new RoomNumber(aJCas); annotation.setBegin(matcher.start()); annotation.setEnd(matcher.end()); annotation.setBuilding("Yorktown");

This RoomNumber class was generated from the type system description by the Component Descriptor Editor or the JCasGen tool, as discussed in the previous section.

Finally, we call annotation.addToIndexes() to add the new annotation to the indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps an index of all annotations in their order from beginning to end of the document. Subsequent annotators or applications use the indexes to iterate over the annotations. It is also possible to define your own custom indexes in the CAS (see Chapter 26, CAS Reference for details).

  • If you don't add the instance to the indexes, it cannot be retrieved by down-stream annotators, using the indexes.

We're almost ready to test the RoomNumberAnnotator. There is just one more step remaining.

Creating the XML Descriptor

The UIMA architecture requires that descriptive information about an annotator be represented in an XML file and provided along with the annotator class file(s) to the UIMA framework at run time. This XML file is called an Analysis Engine Descriptor. The descriptor includes:

  • Name, description, version, and vendor
  • The annotator’s inputs and outputs, defined in terms of the types in a Type System Descriptor
  • Declaration of the configuration parameters that the annotator accepts

The Component Descriptor Editor plugin, which we previously used to edit the Type System descriptor, can also be used to edit Analysis Engine Descriptors.

A descriptor for our RoomNumberAnnotator is provided with the UIMA distribution under the name descriptors/tutorial/ex1/RoomNumberAnnotator.xml. To edit it in Eclipse, right-click on that file in the navigator and select Open With –> Component Descriptor Editor.

Eclipse tip: You can double click on the tab at the top of the Component Descriptor Editor's window identifying the currently selected editor, and the window will "Maximize". Double click it again to restore the original size.

If you are not using Eclipse, you will need to edit Analysis Engine descriptors manually. See Introduction to Analysis Engine Descriptor XML Syntax for an introduction to the Analysis Engine descriptor XML syntax. The remainder of this section assumes you are using the Component Descriptor Editor plug-in to edit the Analysis Engine descriptor.

The Component Descriptor Editor consists of several tabbed pages; we will only need to use a few of them here. For more information on using this editor, see Chapter 12, Component Descriptor Editor User’s Guide

The initial page of the Component Descriptor Editor is the Overview page, which appears as follows

This presents an overview of the RoomNumberAnnotator Analysis Engine (AE). The left side of the page shows that this descriptor is for a Primitive AE (meaning it consists of a single annotator), and that the annotator code is developed in Java. Also, it specifies the Java class that implements our logic (the code which was discussed in the previous section). Finally, on the right side of the page are listed some descriptive attributes of our annotator.

The other two pages that need to be filled out are the Type System page and the Capabilities page. You can switch to these pages using the tabs at the bottom of the Component Descriptor Editor. In the tutorial, these are already filled out for you.

The RoomNumberAnnotator will be using the TutorialTypeSystem we looked at in Section 4.1.1, Defining Types. To specify this, we add this type system to the Analysis Engine's list of Imported Type Systems, using the Type System page's right side panel, as shown here:

On the Capabilities page, we define our annotator's inputs and outputs, in terms of the types in the type system. The Capabilities page is shown below:

Capabilities come in sets; here we're just using one set. The RoomNumberAnnotator is very simple. It requires no input types, as it operates directly on the document text -- which is supplied as a part of the CAS initialization (and which is always assumed to be present). It produces only one output type (RoomNumber), and it sets the value of the building feature on that type. This is all represented on the Capabilities page.

The Capabilities page has two other parts for specifying languages and Sofas. The languages section allows you to specify which languages your Analysis Engine supports. The RoomNumberAnnotator happens to be language-independent, so we can leave this blank. The Sofas section allows you to specify the names of additional subjects of analysis. This capability and the Sofa Mappings at the bottom are advanced topics, described in Chapter 8, Annotations, Artifacts, and Sofas

This is all of the information we need to provide for a simple annotator. If you want to peek at the XML that this tool saves you from having to write, click on the "Source" tab at the bottom to view the generated XML.

Testing Your Annotator

Having developed an annotator, we need a way to try it out on some example documents. The UIMA SDK includes a tool called the Document Analyzer that will allow us to do this. To run the Document Analyzer, execute the documentAnalyzer shell script that is in the bin directory of your UIMA SDK installation, or, if you are using the example Eclipse project, execute the "UIMA Document Analyzer" run configuration supplied with that project. (To do this, click on the menu bar Run > Run ... > and under Java Applications in the left box, click on UIMA Document Analyzer.)

You should see a screen that looks like this:

There are six options on this screen:

  1. Directory containing documents to analyze
  2. Directory where analysis results will be written
  3. The XML descriptor for the Analysis Engine (TAE) you want to run
  4. (Optional) an XML tag, within the input documents, that contains the text to be analyzed. For example, the value TEXT would cause the TAE to only analyze the portion of the document enclosed within <TEXT>...</TEXT> tags.
  5. Language of the document
  6. Character encoding

Use the Browse button next to the 3rd item to set the "Location of TAE XML Descriptor" field to the descriptor we've just been discussing – uima/docs/examples/descriptors/tutorial/ex1/RoomNumberAnnotator.xml. Set the other fields to the values shown in the screen shot above (which should be the default values if this is the first time you've run the Document Analyzer). Then click the "Run" button to start processing.

When processing completes, an "Analysis Results" window should appear.

Make sure "Java Viewer" is selected as the Results Display Format, and double-click on the document UIMASummerSchool2003.txt to view the annotations that were discovered. The view should look something like this:

You can click the mouse on one of the highlighted annotations to see a list of all its features in the frame on the right.

  • The legend will only show those types which have at least one instance in the CAS, and are declared as outputs in the capabilities section of the descriptor (see Creating the XML Descriptor ).

You can use the DocumentAnalyzer to test any UIMA annotator – just make sure that the annotator's classes are in the class path.

Configuration Parameters

The example RoomNumberAnnotator from the previous section used hardcoded regular expressions and location names, which is obviously not very flexible. For example, there is actually a third Watson building – Hawthorne II – whose room numbers are not detected by our annotator. Rather than add a new hardcoded regular expression, a better solution is to use configuration parameters.

UIMA allows annotators to declare configuration parameters in their descriptors. The descriptor also specifies default values for the parameters, though these can be overridden at runtime.

Declaring Parameters in the Descriptor

The example descriptor descriptors/tutorial/ex2/RoomNumberAnnotator.xml is the same as the descriptor from the previous section except that information has been filled in for the Parameters and Parameter Settings pages of the Component Descriptor Editor.

First, in Eclipse, open example 2's RoomNumberAnnotator in the Component Descriptor Editor, and then go to the Parameters page (click on the parameters tab at the bottom of the window), which is shown below:

Two parameters – Patterns and Locations -- have been declared. In this screen shot, the mouse (not shown) is hovering over Patterns to show its description in the small popup window. Every parameter has the following information associated with it:

  • name – the name by which the annotator code refers to the parameter
  • description – a natural language description of the intent of the parameter
  • type – the data type of the parameter's value – must be one of String, Integer, Float, or Boolean.
  • multiValued – true if the parameter can take multiple-values (an array), false if the parameter takes only a single value. Shown above as Multi.
  • mandatory – true if a value must be provided for the parameter. Shown above as Req (for required).

Both of our parameters are mandatory and accept an array of Strings as their value.

Next, default values are assigned to the parameters on the Parameter Settings page:

Here the "Patterns" parameter is selected, and the right pane shows the list of values for this parameter, in this case the regular expressions that match rooms in each of the IBM T.J. Watson Research Center buildings. Notice the third pattern is new, for matching the style of room numbers in the third building, which has room numbers such as J2-A11.

Accessing Parameter Values from the Annotator Code

The class com.ibm.uima.tutorial.ex2.RoomNumberAnnotator has overridden the initialize method. The initialize method is called by the UIMA framework when the annotator is instantiated, so it is a good place to read configuration parameter values. The default initialize method does nothing with configuration parameters, so you have to override it. To see the code in Eclipse, switch to the src folder, and open com.ibm.uima.tutorial.ex2. Here is the method body:

/** * @see BaseAnnotator#initialize(AnnotatorContext) */ public void initialize(AnnotatorContext aContext) throws AnnotatorInitializationException, AnnotatorConfigurationException { // invoke the standard initialization // This saves the value of aContext in a field and makes // it available via the getContext() method of the superclass super.initialize(aContext); try { //Get config. parameter values String[] patternStrings = (String[])aContext.getConfigParameterValue("Patterns"); mLocations = (String[])aContext.getConfigParameterValue("Locations");

//compile regular expressions

mPatterns = new Pattern[patternStrings.length]; for (int i = 0; i < patternStrings.length; i++) { mPatterns[i] = Pattern.compile(patternStrings[i]); } }

catch(AnnotatorContextException e) { throw new AnnotatorInitializationException(e); } }

The first two lines inside the try block are where the configuration parameter values are retrieved. Configuration parameter values are accessed through the AnnotatorContext. As you will see in subsequent sections of this chapter, the AnnotatorContext is the annotator's access point for all of the facilities provided by the UIMA framework – for example logging and external resource access.

The AnnotatorContext.getConfigParameterValue method takes the name of the parameter as an argument; this must match one of the parameters declared in the descriptor. The return value of this method is Object, so it is up to the annotator to cast it to the appropriate type, String[] in this case.

If there is a problem retrieving the parameter values, the AnnotatorContext could throw an AnnotatorContextException. Generally annotators would just catch this exception and rethrow it as an AnnotatorInitializationException, which is what our example annotator does.

To see the configuration parameters working, run the Document Analyzer application and select the descriptor docs/examples/descriptors/tutorial/ ex2/RoomNumberAnnotator.xml. In the example document WatsonConferenceRooms.txt, you should see some examples of Hawthorne II room numbers that would not have been detected by the ex1 version of RoomNumberAnnotator.

Supporting Reconfiguration

If you take a look at the JavaDocs (located in the docs/api directory) for com.ibm.uima.analysis_engine.Annotator.BaseAnnotator (which our annotator implements indirectly through JTextAnnotator_ImplBase), you will see that there is a reconfigure() method, which is called by the containing application through the UIMA framework, if the configuration parameter values are changed.

The JTextAnnotator_ImplBase class provides a default implementation that just calls the annotator's destroy method followed by its initialize method. This works fine for our annotator. The only situation in which you might want to override the default reconfigure() is if your annotator has very expensive initialization logic, and you don't want to reinitialize everything if just one configuration parameter has changed. In that case, you can provide a more intelligent implementation of reconfigure() for your annotator.

Configuration Parameter Groups

For annotators with many sets of configuration parameters, UIMA supports organizing them into groups. It is possible to define a parameter with the same name in multiple groups; one common use for this is for annotators that can process documents in several languages and which want to have different parameter settings for the different languages.

The syntax for defining parameter groups in your descriptor is fairly straightforward – see Chapter 23 for details. Values of parameters defined within groups are accessed through the two-argument version of AnnotatorContext.getConfigParameterValue, which takes both the group name and the parameter name as its arguments.

Logging

The UIMA SDK provides a logging facility, which is very similar to the java.util.logging.Logger class that was introduced in Java 1.4.

In the Java architecture, each logger instance is associated with a name. By convention, this name is often the fully qualified class name of the component issuing the logging call. The name can be referenced in a configuration file when specifying which kinds of log messages to actually log, and where they should go.

The UIMA framework supports this convention using the AnnotatorContext object. If you access a logger instance using getContext().getLogger() within an Annotator, the logger name will be the fully qualified name of the Annotator implementation class.

Here is an example from the process method of com.ibm.uima.tutorial.ex2.RoomNumberAnnotator:

getContext().getLogger().log(Level.FINEST,"Found: " + annotation);

The first argument to the log method is the level of the log output. Here, a value of FINEST indicates that this is a highly-detailed tracing message. While useful for debugging, it is likely that real applications will not output log messages at this level, in order to improve their performance. Other defined levels, from lowest to highest importance, are FINER, FINE, CONFIG, INFO, WARNING, and SEVERE.

If no logging configuration file is provided (see next section), the Java Virtual Machine defaults would be used, which typically set the level to INFO and higher messages, and direct output to the console.

If you specify the standard UIMA SDK Logger.properties, the output will be directed to a file named uima.log, in the current working directory (often the "project" directory when running from Eclipse, for instance).

Eclipse Note: The uima.log file, if written into the Eclipse workspace in the project uima_examples, for example, may not appear in the Eclipse package explorer view until you right-click the uima_examples project with the mouse, and select "Refresh". This operation refreshes the Eclipse display to conform to what may have changed on the file system.

Specifying the Logging Configuration

The standard UIMA logger uses the underlying Java 1.4 logging mechanism. You can use the APIs that come with that to configure the logging. In addition, the Java 1.4 logging initialization look for a Java System Property named java.util.logging.config.file and if found, will use the value of this property as the name of a standard "properties" file, for setting the logging level. Please refer to the Java 1.4. documentation for more information on the format and use of this file.

Two sample logging specification property files can be found in the UIMA_HOME directory: Logger.properties, and FileConsoleLogger.properties. These specify the same logging, except the first logs just to a file, while the second logs both to a file and to the console. You can edit these files, or create additional ones, as described below, to change the logging behavior.

When running your own Java application, you can specify the location of the logging configuration file on your Java command line by setting the Java system property java.util.logging.config.file to be the logging configuration filename. This file specification can be either absolute or relative to the working directory. For example:

java "-Djava.util.logging.config.file=C:/Program Files/apache-uima/Logger.properties"

  • In a shell script, you can use environment variables such as UIMA_HOME if convenient.

If you are using Eclipse to launch your application, you can set this property in the VM arguments section of the Arguments tab of the run configuration screen. If you've set an environment variable UIMA_HOME, you could for example, use the string:
"-Djava.util.logging.config.file=${env_var:UIMA_HOME}/Logger.properties".

Setting Logging Levels

Within the logging control file, the default global logging level specifies which kinds of events are logged across all loggers. For any given facility this global level can be overridden by a facility specific level. Multiple handlers are supported. This allows messages to be directed to a log file, as well as to a "console". Note that the ConsoleHandler also has a separate level setting to limit messages printed to the console. For example:

.level= INFO

The properties file can change where the log is written, as well.

Facility specific properties allow different logging for each class, as well. For example, to set the com.xyz.foo logger to only log SEVERE messages:

com.xyz.foo.level = SEVERE

If you have a sample annotator in the package com.ibm.uima.SampleAnnotator you can set the log level by specifying:

com.ibm.uima.SampleAnnotator.level = ALL

There are other logging controls; for a full discussion, please read the contents of the Logger.properties file and the Java specification for logging in Java 1.4.

Format of logging output

The logging output is formatted by handlers specified in the properties file for configuring logging, described above. The default formatter that comes with the UIMA SDK formats logging output as follows:

Timestamp - threadID: sourceInfo: Message level: message

Here's an example:

7/12/04 2:15:35 PM - 10: com.ibm.uima.util.TestClass.main(62): INFO: You are not logged in!

Meaning of the logging severity levels

These levels are defined by the Java logging framework, which was incorporated into Java as of the 1.4 release level. The levels are defined in the JavaDocs for java.util.logging.Level, and include both logging and tracing levels:

  • OFF is a special level that can be used to turn off logging.
  • ALL indicates that all messages should be logged.
  • CONFIG is a message level for configuration messages. These would typically occur once (during configuration) in methods like initialize().
  • INFO is a message level for informational messages, for example, connected to server IP: 192.168.120.12
  • WARNING is a message level indicating a potential problem.
  • SEVERE is a message level indicating a serious failure.

    Tracing levels, typically used for debugging:
  • FINE is a message level providing tracing information, typically at a collection level (messages occurring once per collection).
  • FINER indicates a fairly detailed tracing message, typically at a document level (once per document).
  • FINEST indicates a highly detailed tracing message.

Using the logger outside of an annotator

An application using UIMA may want to log its messages using the same logging framework. This can be done by getting a reference to the UIMA logger, as follows:

Logger logger = UIMAFramework.getLogger(TestClass.class);

The optional class argument allows filtering by class (if the log handler supports this). If not specified, the name of the returned logger instance is "com.ibm.uima".

Combining Annotators

The UIMA SDK makes it very easy to combine any sequence of Analysis Engines to form an Aggregate Analysis Engine. This is done through an XML descriptor; no Java code is required!

If you go to the docs/examples/descriptors/tutorial/ex3 folder (in Eclipse, it's in your uima_examples project, under the descriptors/tutorial/ex3 folder), you will find a descriptor for a TutorialDateTime annotator. This annotator detects dates and times (and also sentences and words). To see what this annotator can do, try it out using the Document Analyzer. If you are curious as to how this annotator works, the source code is included, but it is not necessary to understand the code at this time.

We are going to combine the TutorialDateTime annotator with the RoomNumberAnnotator to create an aggregate Analysis Engine. This is illustrated in Figure nnn. Combining Annotators to form and Aggregate Analysis Engine The descriptor that does this is named RoomNumberAndDateTime.xml, which you can open in the Component Descriptor Editor plug-in. This is in the uima_examples project in the folder descriptors/tutorial/ex3.

The "Aggregate" page of the Component Descriptor Editor is used to define which components make up the aggregate. A screen shot is shown below. (If you are not using Eclipse, see Section 4.8, Introduction to Analysis Engine Descriptor XML Syntax for the actual XML syntax for Aggregate Analysis Engine Descriptors.)

On the left side of the screen is the list of component engines that make up the aggregate – in this case, the TutorialDateTime annotator and the RoomNumberAnnotator. To add a component, you can click the "Add" button and browse to its descriptor. You can also click the "Find AE" button and search for an Analysis Engine in your Eclipse workspace.

  • The "AddRemote" button is used for adding components which run remotely (for example, on another machine using a remote networking connection). This capability is described in section 6.6.3, How to Call a UIMA Service .

The order of the components in the left pane does not imply an order of execution. The order of execution, or "flow" is determined in the "Component Engine Flow" section on the right. UIMA supports different types of algorithms (possibly dynamic) for determining the flow. Here we pick the simplest: FixedFlow. We have chosen to have the RoomNumberAnnotator execute first, although in this case it doesn't really matter, since the RoomNumber and DateTime annotators do not have any dependencies on one another.

If you look at the "Type System" page of the Component Descriptor Editor, you will see that it displays the type system but is not editable. The Type System of an Aggregate Analysis Engine is automatically computed by merging the Type Systems of each of its components.

The Capabilities page is where you explicitly declare the aggregate Analysis Engine's inputs and outputs. Sofas and Languages are described later.

Note that it is not automatically assumed that all outputs of each component Analysis Engine (AE) are passed through as outputs of the aggregate AE. In this case, for example, we have decided to suppress the Word and Sentence annotations that are produced by the TutorialDateTime annotator.

You can run this AE using the Document Analyzer in the same way that you run any other AE. Just select the docs/examples/descriptors/tutorial/ex3/ RoomNumberAndDateTime.xml descriptor and click the Run button. You should see that RoomNumbers, Dates, and Times are all shown but that Words and Sentences are not:

Aggregate Engines can also contain CAS Consumers

In addition to aggregating Analysis Engines, Aggregates can also contain CAS Consumers (see Developing CAS Consumers on page 5-122), or even a mixture of these components. The UIMA Examples has an example of an Aggregate which contains both an analysis engine and a CAS consumer, in docs/examples/descriptors/MixedAggregate.xml.

Reading the Results of Previous Annotators

So far, we have been looking at annotators that look directly at the document text. However, annotators can also use the results of other annotators. One useful thing we can do at this point is look for the co-occurrence of a Date, a RoomNumber, and two Times – and annotate that as a Meeting.

The JCas maintains indexes of annotations, and from an index you can obtain an iterator that allows you to step through all annotations of a particular type. Here's some example code that would iterate over all of the TimeAnnot annotations in the JCas:

JFSIndexRepository indexes = aJCas.getJFSIndexRepository(); FSIndex timeIndex = indexes.getAnnotationIndex(TimeAnnot.type); Iterator timeIter = timeIndex.iterator(); while (timeIter.hasNext()) { TimeAnnot time = (TimeAnnot)timeIter.next(); //do something }

Now that we've explained the basics, let's take a look at the process method for com.ibm.uima.tutorial.ex4.MeetingAnnotator. Since we're looking for a combination of a RoomNumber, a Date, and two Times, there are four nested iterators. (There's surely a better algorithm for doing this, but to keep things simple we're just going to look at every combination of the four items.)

For each combination of the four annotations, we compute the span of text that includes all of them, and then we check to see if that span is smaller than a "window" size, a configuration parameter. There are also some checks to make sure that we don't annotate the same span of text multiple times. If all the checks pass, we create a Meeting annotation over the whole span. There's really nothing to it!

The XML descriptor, located in docs/examples/descriptors/tutorial/ex4/MeetingAnnotator.xml, is also very straightforward. An important difference from previous descriptors is that this is the first annotator we've discussed that has input requirements. This can be seen on the "Capabilities" page of the Component Descriptor Editor:

If we were to run the MeetingAnnotator on its own, it wouldn't detect anything because it wouldn't have any input annotations to work with. The required input annotations can be produced by the RoomNumber and DateTime annotators. So, we create an aggregate Analysis Engine containing these two annotators, followed by the Meeting annotator. This aggregate is illustrated in Figure 9. The descriptor for this is in docs/examples/descriptors/tutorial/ex4/MeetingDetectorTAE.xml. Give it a try in the Document Analyzer. An Aggregate Analysis Engine where an internal component uses output from previous engines.

The UIMA SDK include several other examples you may find interesting, including

  • SimpleTokenAndSentenceAnnotator – a simple tokenizer and sentence annotator.
  • PersonTitleDBWriterCasConsumer – a sample CAS Consumer which populates a relational database with some annotations. It uses JDBC and in this example, hooks up with the Open Source Apache Derby database.

Contract for Annotator methods called by the Framework

Every instance of an Annotator is associated with one and only one thread. An instance never has to worry about running some method on one thread, and then asynchronously being called using another thread. This approach simplifies the design of annotators – they do not have to be designed to support multi-threading. When multiple threading is wanted, for performance, multiple instances of the Annotator are created, each one running on just one thread.

The following table defines the methods called by the framework, when they are called, and the requirements annotator implementations must follow.

Method

When Called by Framework

Requirements

initialize

Called once, when instance is created.

Should read configuration parameter information and set up for processing CASes

typeSystemInit

Called before Process whenever the type system in the CAS being passed in differs from what was previously passed in a Process call (and called for the first CAS passed in, too). The Type System being passed to an annotator only changes for the case of remote annotators that are active as servers, receiving possibly different type systems to operate on.

Typically, users of JCas do not implement any method for this. An annotator can use this call to read the CAS type system and setup any instance variables that make accessing the types and features convenient.

process

Called once for each CAS. Called by the application if not using Collection Processing Manager (CPM); the application calls the process method on the analysis engine, which is then delegated by the framework to all the annotators in the engine. For Collection Processing application, the CPM calls the process method. If the application creates and manages your own Collection Processing Engine via API calls (see JavaDocs), the application calls this on the Collection Processing Engine, and it is delegated by the framework to the components.

Process the CAS, adding and/or modifying elements in it

destroy

This method is called by the Collection Processing Manager framework when the collection processing completes. It can also be called by an application on the Engine object, in which case it is propagated to all contained annotators.

An annotator should release all resources, close files, close database connections, etc., and return to a state where another initialize call could be received to restart. Typically, after a destroy call, no further calls will be made to an annotator instance.

reconfigure

This method is never called by the framework, unless an application calls it on the Engine object – in which case it the framework propagates it to all annotators contained in the Engine.

Its purpose is to signal that the configuration parameters have changed.

A default implementation of this calls destroy, followed by initialize. This is the only case where initialize would be called more than once. Users should implement whatever logic is needed to return the annotator to an initialized state, including re-reading the configuration parameter data.

Reporting errors from Annotators

There are two broad classes of errors that can occur: recoverable an unrecoverable. Because Annotators are often expected to process very large numbers of artifacts (for example, text documents), they should be written to recover where possible.

For example, if an upstream annotator created some input for an annotator which is invalid, the annotator may want to log this event, ignore the bad input and continue. It may include a notification of this event in the CAS, for further downstream annotators to consider. Or, it may throw an exception (see next section) – but in this case, it cannot do any further processing on that document.

  • The choice of what to do can be made configurable, using the configuration parameters.

Throwing Exceptions from Annotators

Let's say an invalid regular expression was passed as a parameter to the RoomNumberAnnotator. Because this is an error related to the overall configuration, and not something we could expect to ignore, we should throw an appropriate exception, and most Java programmers would expect to do so like this:

throw new AnnotatorConfigurationException("The regular expression " + x + " is not valid.");

UIMA, however, does not do it this way. All UIMA exceptions are internationalized, meaning that they support translation into other languages. This is accomplished by eliminating hardcoded message strings and instead using external message digests. Message digests are files containing (key, value) pairs. The key is used in the Java code instead of the actual message string. This allows the message string to be easily translated later by modifying the message digest file, not the Java code. Also, message strings in the digest can contain parameters that are filled in when the exception is thrown. The format of the message digest file is described in the JavaDocs for the Java class java.util.PropertyResourceBundler and in the load method of java.util.Properties.

The first thing an annotator developer must choose is what Exception class to use. There are three to choose from:

  1. AnnotatorConfigurationException should be thrown from the annotator's initialize() method if invalid configuration parameter values have been specified.
  2. AnnotatorInitializationException should be thrown from the annotator's initialize() method if initialization fails for some other reason.
  3. AnnotatorProcessException should be thrown from the annotator's process() method if the processing of a particular document fails for any reason.

Generally you will not need to define your own custom exception classes, but if you do they must extend one of these three classes, which are the only types of Exceptions that the annotator interface permits annotators to throw.

All of the UIMA Exception classes share common Constructor varieties. There are four possible arguments:

The name of the message digest to use (optional – if not specified the default UIMA message digest is used).

The key string used to select the message in the message digest.

An object array containing the parameters to include in the message. Messages can have substitutable parts. When the message is given, the string representation of the objects passed are substituted into the message. The object array is often created using the syntax new Object[]{x, y}.

Another exception which is the "cause" of the exception you are throwing. This feature is commonly used when you catch another exception and rethrow it. (optional)

If you look at source file (folder: src in Eclipse) com.ibm.uima.tutorial.ex5.RoomNumberAnnotator, you will see the following code:

Try { mPatterns[i] = Pattern.compile(patternStrings[i]); } catch(PatternSyntaxException e) { throw new AnnotatorConfigurationException( MESSAGE_DIGEST, "regex_syntax_error", new Object[]{patternStrings[i]}, e); }

where the MESSAGE_DIGEST constant has the value "com.ibm.uima.tutorial.ex5.RoomNumberAnnotator_Messages".

Message digests are specified using a dotted name, just like Java classes. This file, with the .properties extension, must be present in the class path. In Eclipse, you find this file under the src folder, in the package com.ibm.uima.tutorial.ex5, with the name RoomNumberAnnotator_Messages.properties. Outside of Eclipse, you can find this in the uima_examples.jar with the name com/ibm/uima/tutorial/ex5/RoomNumberAnnotator_Messages.properties. If you look in this file you will see the line:

regex_syntax_error = {0} is not a valid regular expression.

which is the error message for the example exception we showed above. The placeholder {0} will be filled by the toString() value of the argument passed to the exception constructor – in this case, the regular expression pattern that didn't compile. If there were additional arguments, their locations in the message would be indicated as {1}, {2}, and so on.

If a message digest is not specified in the call to the exception constructor, the default is UIMAException.STANDARD_MESSAGE_CATALOG (whose value is "com.ibm.uima.UIMAException_Messages" in the current release but may change). This message digest is located in the uima_core.jar file at com/ibm/uima/UIMAException_messages.properties – you can take a look to see if any of these exception messages are useful to use.

To try out the regex_syntax_error exception, just use the Document Analyzer to run docs/examples/descriptors/tutorial/ex5/RoomNumberAnnotator.xml, which happens to have an invalid regular expression in its configuration parameter settings.

To summarize, here are the steps to take if you want to define your own exception message:

Create a file with the .properties extension, where you declare message keys and their associated messages, using the same syntax as shown above for the regex_syntax_error exception. The properties file syntax is more completely described in the JavaDocs for the load method of the java.util.Properties class.

Put your properties file somewhere in your class path (it can be in your annotator’s .jar file).

Define a String constant (called MESSAGE_DIGEST for example) in your annotator code whose value is the dotted name of this properties file. For example, if your properties file is inside your jar file at the location org/myorg/myannotator/Messages.properties, then this String constant should have the value org.myorg.myannotator.Messages. Do not include the .properties extension. In Java Internationalization terminology, this is called the Resource Bundle name. For more information see the JavaDocs for the PropertyResourceBundle class.

In your annotator code, throw an exception like this:

throw new AnnotatorConfigurationException(MESSAGE_DIGEST, "your_message_name",new Object[]{param1,param2,...});

You may also wish to look at the JavaDocs for the UIMAException class.

For more information on Java's internationalization features, see the Java Internationalization Guide at http://java.sun.com/j2se/1.4/docs/guide/intl/index.html.

Accessing External Resource Files

Sometimes you may want an annotator to read from an external file – for example, a long list of keys and values that you are going to build into a HashMap. You could, of course, just introduce a configuration parameter that holds the absolute path to this resource file, and build the HashMap in your annotator's initialize method. However, this is not the best solution for three reasons:

  1. Including an absolute path in your descriptor makes your annotator difficult for others to use. Each user will need to edit this descriptor and set the absolute path to a value appropriate for his or her installation.
  2. You cannot share the HashMap between multiple annotators. Also, in some deployment scenarios there may be more than one instance of your annotator, and you would like to have the option for them to use the same HashMap instance.
  3. Your annotator would become dependent on a particular data representation – the word list would have to come from a file on the local disk and it would have to be in a particular format. It would be better if this were decoupled.

A better way to access external resource is through the ResourceManager component. In this section we are going to show an example of how to use the Resource Manager.

This example annotator will annotate UIMA acronyms (e.g. UIMA, TAE, CAS, JCas) and store the acronym's expanded form as a feature of the annotation. The acronyms and their expanded forms are stored in an external file.

First, look at the docs/examples/descriptors/tutorial/ex6/ UimaAcronymAnnotator.xml descriptor.

The values of the rows in the two tables are longer than can be easily shown. You can click the small button at the top right to shift the layout from two side-by-side tables, to a vertically stacked layout. You can also click the small twisty on the "Imports for External Resources and Bindings" to collapse this section, because it's not used. Then the same screen will appear like this:

The top window has a scroll bar allowing you to see the rest of the line.

Declaring Resource Dependencies

The bottom window is where an annotator declares an external resource dependency. The XML for this is as follows:

<externalResourceDependency> <key>AcronymTable</key> <description>Table of acronyms and their expanded forms.</description> <interfaceName> com.ibm.uima.tutorial.ex6.StringMapResource </interfaceName> </externalResourceDependency>

The <key> value (AcronymTable) is the name by which the annotator identifies this resource. The key must be unique for all resources that this annotator accesses, but the same key could be used by different annotators to mean different things. The interface name (com.ibm.uima.tutorial.ex6.StringMapResource) is the Java interface through which the annotator accesses the data. Specifying an interface name is optional. If you do not specify an interface name, annotators will get direct access to the data file.

Accessing the Resource from the AnnotatorContext

If you look at the com.ibm.uima.tutorial.ex6.UimaAcronymAnnotator source, you will see that the annotator accesses this resource from the AnnotatorContext by calling:

StringMapResource mMap = (StringMapResource)getContext().getResourceObject("AcronymTable");

The object returned from the getResourceObject method will implement the interface declared in the <interfaceName> section of the descriptor, StringMapResource in this case. The annotator code does not need to know the location of the data nor the Java class that is being used to read the data and implement the StringMapResource interface.

Note that if we did not specify a Java interface in our descriptor, our annotator could directly access the resource data as follows:

InputStream stream = getContext().getResourceAsStream("AcronymTable");

If necessary, the annotator could also determine the location of the resource file, by calling:

URL url = getContext().getResourceURL("AcronymTable");

These last two options are only available in the case where the descriptor does not declare a Java interface.

Declaring Resources and Bindings

Refer back to the top window in the Resources page of the Component Descriptor Editor. This is where we specify the location of the resource data, and the Java class used to read the data. For the example, this corresponds to the following section of the descriptor:

<resourceManagerConfiguration> <externalResources> <externalResource> <name>UimaAcronymTableFile</name> <description> A table containing UIMA acronyms and their expanded forms. </description> <fileResourceSpecifier> <fileUrl>file:com/ibm/uima/tutorial/ex6/uimaAcronyms.txt </fileUrl> </fileResourceSpecifier> <implementationName> com.ibm.uima.tutorial.ex6.StringMapResource_impl </implementationName> </externalResource> </externalResources>

<externalResourceBindings> <externalResourceBinding> <key>AcronymTable</key> <resourceName>UimaAcronymTableFile</resourceName> </externalResourceBinding> </externalResourceBindings> </resourceManagerConfiguration>

The first section of this XML declares an externalResource, the UimaAcronymTableFile. With this, the fileUrl element specifies the path to the data file. This can be an absolute URL (e.g. one that starts with file:/ or file:///, or file://my.host.org/), but that is not recommended because it makes installation of your component more difficult, as noted earlier. Better is a relative URL, which will be looked up within the classpath (and/or datapath), as used in this example. In this case, the file com/ibm/uima/tutorial/ex6/uimaAcronyms.txt is located in uima_examples.jar, which is in the classpath. If you look in this file you will see the definitions of several UIMA acronyms.

The second section of the XML declares an externalResourceBinding, which connects the key AcronymTable, declared in the annotator’s external resource dependency, to the actual resource name UimaAcronymTableFile. This is rather trivial in this case; for more on bindings see the example UimaMeetingDetectorTAE.xml below. There is no global repository for external resources; it is up to the user to define each resource needed by a particular set of annotators.

In the Component Descriptor Editor, bindings are indicated below the external resource. To create a new binding, you select an external resource (which must have previously been defined), and an external resource dependency, and then click the Bind button, which only enables if you have selected two things to bind together.

When the Analysis Engine is initialized, it creates a single instance of StringMapResource_impl and loads it with the contents of the data file. The UimaAcronymAnnotator then accesses the data through the StringMapResource interface. This single instance could be shared among multiple annotators, as will be explained later.

Note that all resource implementation classes (e.g. StringMapResource_impl in the provided example) need to be public and have public, 0-argument constructors, so that they can be instantiated by the framework. (Although Java classes in which you do not define any constructor will, by default, have a 0-argument constructor that doesn't do anything, a class in which you have defined at least one constructor does not get a default 0-argument constructor.)

This annotator is illustrated in Figure 10. To see it in action, just run it using the Document Analyzer. When it finishes, open up the UIMA_Seminars document in the processed results window, (double-click it), and then left-click on one of the highlighted terms, to see the expandedForm feature's value. External Resource Binding

By designing our annotator in this way, we have gained some flexibility. We can freely replace the StringMapResource_impl class with any other implementation that implements the simple StringMapResource interface. (For example, for very large resources we might not be able to have the entire map in memory.) We have also made our external resource dependencies explicit in the descriptor, which will help others to deploy our annotator.

Sharing Resources between Annotators

Another advantage of the Resource Manager is that it allows our data to be shared between annotators. To demonstrate this we have developed another annotator that will use the same acronym table. The UimaMeetingAnnotator will iterate over Meeting annotations discovered by the Meeting Detector we previously developed and attempt to determine whether the topic of the meeting is related to UIMA. It will do this by looking for occurrences of UIMA acronyms in close proximity to the meeting annotation. We could implement this by using the UimaAcronymAnnotator, of course, but for the sake of this example we will have the UimaMeetingAnnotator access the acronym map directly.

The Java code for the UimaMeetingAnnotator in example 6 creates a new type, UimaMeeting, if it finds a meeting with 50 characters of the UIMA acronym.

We combine three analysis engines, the UimaAcronymAnnotator to annotate UIMA acronyms, the MeetingDectector from example 4 to find meetings and finally the UimaMeetingAnnotator to annotate just meetings about UIMA. Together these are assembled to form the new aggregate analysis engine, UimaMeetingDectector. This aggregate and the sharing of a common resource are illustrated in Figure 11. Component engines of an aggregate share a common resource The important thing to notice is in the UimaMeetingDetectorTAE.xml aggregate descriptor. It includes both the UimaMeetingAnnotator and the UimaAcronymAnnotator, and contains a single declaration of the UimaAcronymTableFile resource. (The actual example has the order of the first two annotators reversed versus the above picture, which is OK since they do not depend on one another).

It also binds the resources as follows:

<externalResourceBindings> <externalResourceBinding> <key>UimaAcronymAnnotator/AcronymTable</key> <resourceName>UimaAcronymTableFile</resourceName> </externalResourceBinding> <externalResourceBinding> <key>UimaMeetingAnnotator/UimaTermTable</key> <resourceName>UimaAcronymTableFile</resourceName> </externalResourceBinding> </externalResourceBindings>

This binds the resource dependencies of both the UimaAcronymAnnotator (which uses the name AcronymTable) and UimaMeetingAnnotator (which uses UimaTermTable) to the single declared resource named UimaAcronymFile. Therefore they will share the same instance. Resource bindings in the aggregate descriptor override any resource declarations in individual annotator descriptors.

If we wanted to have the annotators use different acronym tables, we could easily do that. We would simply have to change the resourceName elements in the bindings so that they referred to two different resources. The Resource Manager gives us the flexibility to make this decision at deployment time, without changing any Java code.

Result Specification Setting

The Result Specification is a parameter passed to all annotators, as the second argument in the process(...) call. It is a list of output types and / or type:feature specifications, which are expected to be "output" from the annotator. Annotators may use this to optimize their operations, when possible, for those cases where only particular outputs are wanted. The interface to the Result Specification object (see the JavaDocs) allows querying both types and particular features of types.

Sometimes you can specify the Result Specification; othertimes, you cannot (for instance, inside a Collection Processing Engine, you cannot). When you cannot specify it, or choose not to specify it (for example, using the form of the process(...) call on an Analysis Engine that doesn't include the Result Specification), a "Default" Result Specification is used.

Default Result Specification

The default Result Specification is taken from the Engine's output Capability Specification. Remember that a Capability Specification has both inputs and outputs, can specify types and / or features, and there can be more than one Capability Set. If there is more than one set, the logical union of these sets is used. The default Result Specification is exactly what's included in the output Capability Specification.

Passing Result Specifications to Annotators

If you are not using aggregation or collection processing, but instead are instantiating your own primitive analysis engines and calling their process methods, you can pass whatever Result Specification is appropriate in your call to process(CAS, ResultSpecification). For primitive engines, whatever you pass in is passed along as the value of the 2nd argument in the annotator's process() method. If you use the form of the call without the Result Specification, the default Result Specification is created and passed, as above.

Aggregates

For aggregate engines, the value passed to the primitive annotator code depends on the kind of flow.

Fixed Flow

For FixedFlow, any ResultSpecification passed into the aggregate is ignored, and instead, each primitive annotator is passed a result spec that corresponds to the union of its output capability specifications at the primitive descriptor level. If no output capability specification is given, the annotator will still be called, but the result specification will be empty.

CapabilityLanguageFlow

For CapabilityLanguageFlow, each annotator is passed a ResultSpecification that is the intersection of the primitive annotator's output Capability Specification with the ResultSpecification passed to the aggregate. If this intersection is null (the annotator does not produce any type or feature included in the ResultSpecification), the annotator will not be called at all.

Therefore, if using the CapabilityLanguageFlow, if you want to supply a custom ResultSpecification for the aggregate it must include any intermediate types that need to be produced internally in the flow, or else things will not work properly.

Special rule for skipping Analysis Engines

When using the CapabilityLanguageFlow, an annotator will be also be skipped if all of its outputs are in the output capability of some annotator(s) that has (have) executed previously in the flow. The concept here is that if all of an annotator's output types have already been produced, that annotator will not be called.

For an Aggregate, each annotator is passed a Result Specification that is the intersection of the set of types mentioned in its output with the Result Specification passed to the aggregate. If this intersection is null (the annotator does not produce any type included in the ResultSpecification), the annotator will not be called at all.

Therefore, if using the CapabilityLanguageFlow, if you want to supply a custom ResultSpecification for the aggregate it must include any intermediate types that need to be produced, or else things will not work properly.

Collection Processing Engines

The Default Result Specification is always used for all components of a Collection Processing Engine.

Class path setup when using JCas

JCas provides Java classes that correspond to each CAS type in an application. These classes are generated by the JCasGen utility (which can be automatically invoked from the Component Descriptor Editor).

The Java source classes generated by the JCasGen utility are typically compiled and packaged into a JAR file. This JAR file must be present in the classpath of the UIMA application.

More details on issues around setting up this class path, including deployment issues where class loaders are being used to isolate multiple UIMA applications inside a single running Java Virtual Machine, please see Class Loaders in UIMA .

Using the Shell Scripts

The SDK includes a /bin subdirectory containing shell scripts, for Windows (.bat files) and Linux (.sh files). Many of these scripts invoke sample Java programs which require a class path. The UIMA required files and directories on the class path are set up using the shell script: setUimaClassPath.

If you need to include files on the class path, the scripts are set up to add anything you specify in the environment variable UIMA_CLASSPATH to the classpath. So, for example, if you are running the document analyzer, and wanted it to find a Java class file named (on Windows) c:\a\b\c\myProject\myJarFile.jar, you could first issue a set command to set the UIMA_CLASSPATH to this file, followed by the documentAnalyzer script:

set UIMA_CLASSPATH=c:abcmyProjectmyJarFile.jar documentAnalyzer

Other environment variables are used by the shell scripts, as follows:

UIMA_HOME

Path where the UIMA SDK was installed. Set automatically if installing via the InstallShield installer.

JAVA_HOME

(Optional) Path to a Java Runtime Environment. If not set, the Java JRE that is shipped with the UIMA SDK (InstallShield versions) is used.

UIMA_CLASSPATH

(Optional) if specified, a path specification to use as the default ClassPath.

UIMA_DATAPATH

(Optional) if specified, a path specification to use as the default DataPath (see section 23.2 )

VNS_HOST

(Optional) if specified, the network IP name of the host running the Vinci Name Server (VNS) (see The Vinci Naming Service (VNS) )

VNS_PORT

(Optional) if specified, the network IP port number of the Vinci Name Server (VNS) (see The Vinci Naming Service (VNS) )

ECLIPSE_HOME

(Optional) Needs to be set to the root of your Eclipse installation when using shell scripts that invoke Eclipse (e.g. jcasgen_merge)

Here are some things to avoid doing in your annotator code:

Retaining references to JCas objects between calls to process()

The JCas will be cleared between calls to your annotator's process() method. All of the analysis results related to the previous document will be deleted to make way for analysis of a new document. Therefore, you should never save a reference to a JCas Feature Structure object (i.e. an instance of a class created using JCasGen) and attempt to reuse it in a future invocation of the process() method. If you do so, the results will be undefined.

Careless use of static data

Always keep in mind that an application that uses your annotator may create multiple instances of your annotator class. A multithreaded application may attempt to use two instances of your annotator to process two different documents simultaneously. This will generally not cause any problems as long as your annotator instances do not share static data.

In general, you should not use static variables other than static final constants of primitive data types (String, int, float, etc). Other types of static variables may allow one annotator instance to set a value that affects another annotator instance, which can lead to unexpected effects. Also, static references to classes that aren't thread-safe are likely to cause errors in multithreaded applications.

Eclipse (as of version 3.1 or later) has a new feature for viewing Java Logical Structures. When enabled, it will permit you to see a view of FeatureStructure objects which show all of the features. For example, here is a view of a FeatureStructure for the RoomNumber annotation, from the tutorial example 1:

The "annotation" object in Java shows as a 2 element object, not very convenient for seeing the features. But if you turn on the Java Logical Structure mode by pushing this button:

the features of the FeatureStructure instance will be shown:

This section is an introduction to the syntax used for Analysis Engine Descriptors. Most users do not need to understand these details; they can use the Component Descriptor Editor Eclipse plugin to edit Analysis Engine Descriptors rather than editing the XML directly.

This section walks through the actual XML descriptor for the RoomNumberAnnotator example introduced in section 4.1 . The discussion is divided into several logical sections of the descriptor.

The full specification for Analysis Engine Descriptors is defined in Chapter 23, Component Descriptor Reference

Header and Annotator Class Identification

<?xml version="1.0" encoding="UTF-8" ?> <!-- Descriptor for the example RoomNumberAnnotator. --> <analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier"> <frameworkImplementation>com.ibm.uima.java</frameworkImplementation> <primitive>true</primitive> <annotatorImplementationName> com.ibm.uima.tutorial.ex1.RoomNumberAnnotator </annotatorImplementationName>

The document begins with a standard XML header and a comment. The root element of the document is named <analysisEngineDescription>, and must specify the XML namespace http://uima.apache.org/resourceSpecifier.

The first subelement, <frameworkImplementation>, must contain the value com.ibm.uima.java. The second subelement, <primitive>, contains the Boolean value true, indicating that this XML document describes a Primitive Analysis Engine. A Primitive Analysis Engine is comprised of a single annotator. It is also possible to construct XML descriptors for non-primitive or Aggregate Analysis Engines; this is covered later.

The next element, <annotatorImplementationName>, contains the fully-qualified class name of our annotator class. This is how the UIMA framework determines which annotator class to instantiate.

Simple Metadata Attributes

<analysisEngineMetaData> <name>Room Number Annotator</name> <description>An example annotator that searches for room numbers in the IBM Watson research buildings.</description> <version>1.0</version> <vendor>IBM</vendor>

Here are shown four simple metadata fields – name, description, version, and vendor. Providing values for these fields is optional, but recommended.

Type System Definition

<typeSystemDescription> <imports> <import location="TutorialTypeSystem.xml"/> </imports> </typeSystemDescription>

This section of the XML descriptor defines which types the annotator works with. The recommended way to do this is to import the type system definition from a separate file, as shown here. The location specified here should be a relative path, and it will be resolved relative to the location of the aggregate descriptor. It is also possible to define types directly in the Analysis Engine descriptor, but these types will not be easily shareable by others.

Capabilities

<capabilities> <capability> <inputs /> <outputs> <type>com.ibm.uima.tutorial.RoomNumber</type> <feature>com.ibm.uima.tutorial.RoomNumber:building</feature> </outputs> </capability> </capabilities>

The last section of the descriptor describes the Capabilities of the annotator – the Types/Features it consumes (input) and the Types/Features that it produces (output). These must be the names of types and features that exist in the ANALYSIS ENGINE descriptor’s type system definition.

Our annotator outputs only one Type, RoomNumber and one feature, RoomNumber:building. The fully-qualified names (including namespace) are needed.

The building feature is listed separately here, but clearly specifying every feature for a complex type would be cumbersome. Therefore, a shortcut syntax exists. The <outputs> section above could be replaced with the equivalent section:

<outputs> <type allAnnotatorFeatures ="true"> com.ibm.uima.tutorial.RoomNumber </type> </outputs>

Configuration Parameters (Optional)

Configuration Parameter Declarations

<configurationParameters> <configurationParameter> <name>Patterns</name> <description>List of room number regular expression patterns. </description> <type>String</type> <multiValued>true</multiValued> <mandatory>true</mandatory> </configurationParameter> <configurationParameter> <name>Locations</name> <description>List of locations corresponding to the room number expressions specified by the Patterns parameter. </description> <type>String</type> <multiValued>true</multiValued> <mandatory>true</mandatory> </configurationParameter> </configurationParameters>

The <configurationParameters> element contains the definitions of the configuration parameters that our annotator accepts. We have declared two parameters. For each configuration parameter, the following are specified:

  • name – the name that the annotator code uses to refer to the parameter
  • description – a natural language description of the intent of the parameter
  • type – the data type of the parameter's value – must be one of String, Integer, Float, or Boolean.
  • multiValued – true if the parameter can take multiple-values (an array), false if the parameter takes only a single value.
  • mandatory – true if a value must be provided for the parameter

Both of our parameters are mandatory and accept an array of Strings as their value.

Configuration Parameter Settings

<configurationParameterSettings> <nameValuePair> <name>Patterns</name> <value> <array> <string>b[0-4]d-[0-2]ddb</string> <string>b[G1-4][NS]-[A-Z]ddb</string> <string>bJ[12]-[A-Z]ddb</string> </array> </value> </nameValuePair> <nameValuePair> <name>Locations</name> <value> <array> <string>Watson - Yorktown</string> <string>Watson - Hawthorne I</string> <string>Watson - Hawthorne II</string> </array> </value> </nameValuePair> </configurationParameterSettings>

Aggregate Analysis Engine Descriptor

<?xml version="1.0" encoding="UTF-8" ?> <analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier"> <frameworkImplementation>com.ibm.uima.java</frameworkImplementation> <primitive>false</primitive> <delegateAnalysisEngineSpecifiers> <delegateAnalysisEngine key="RoomNumber"> <import location="../ex2/RoomNumberAnnotator.xml" /> </delegateAnalysisEngine> <delegateAnalysisEngine key="DateTime"> <import location="TutorialDateTime.xml" /> </delegateAnalysisEngine> </delegateAnalysisEngineSpecifiers>

The first difference between this descriptor and an individual annotator's descriptor is that the <primitive> element contains the value false. This indicates that this Analysis Engine (AE) is an aggregate AE rather than a primitive AE.

Then, instead of a single annotator class name, we have a list of delegateAnalysisEngineSpecifiers. Each specifies one of the components that constitute our Aggregate . We refer to each component by the relative path from this XML descriptor to the component AE's XML descriptor.

This list of component AEs does not imply a fixed ordering. Ordering is done by another section of the descriptor:

<analysisEngineMetaData> <name>Aggregate TAE - Room Number and DateTime Annotators</name> <description>Detects Room Numbers, Dates, and Times</description> <flowConstraints> <fixedFlow> <node>RoomNumber</node> <node>DateTime</node> </fixedFlow> </flowConstraints>

Currently, a fixedFlow is required, and we must specify the exact ordering in which the AEs will be executed. In this case, it doesn't really matter, since the RoomNumber and DateTime annotators do not have any dependencies on one another.

Finally, the descriptor has a capabilities section, which has exactly the same syntax as a primitive AE's capabilities section:

<capabilities> <capability> <inputs /> <outputs> <type allAnnotatorFeatures="true">com.ibm.uima.tutorial.RoomNumber </type> <type allAnnotatorFeatures="true">com.ibm.uima.tutorial.DateAnnot </type> <type allAnnotatorFeatures="true">com.ibm.uima.tutorial.TimeAnnot </type> </outputs> <languagesSupported> <language>en</language> </languagesSupported> </capability> </capabilities>