Collection Processing Engine Developer's Guide

The UIMA Analysis Engine interface provides support for developing and integrating algorithms that analyze unstructured data. Analysis Engines are designed to operate on a per-document basis. Their interface handles one CAS at a time. UIMA provides additional support for applying analysis engines to collections of unstructured data with its Collection Processing Architecture. The Collection Processing Architecture defines additional components for reading raw data formats from data collections, preparing the data for processing by Analysis Engines, executing the analysis, extracting analysis results, and deploying the overall flow in a variety of local and distributed configurations.

The functionality defined in the Collection Processing Architecture is implemented by a Collection Processing Engine (CPE). A CPE includes an Analysis Engine and adds a Collection Reader, a CAS Initializer, and CAS Consumers. The part of the UIMA Framework that supports CPEs is called the Collection Processing Manager, or CPM.

A Collection Reader provides the interface to the raw input data and knows how to iterate over the data collection. Collection Readers are discussed in Section 5.4.1 . The CAS Initializer prepares an individual data item for analysis and loads it into the CAS. CAS Initializers are discussed in Section 5.4.2 . A CAS Consumer extracts analysis results from the CAS and may also perform collection level processing, or analysis over a collection of CASes. CAS Consumers are discussed in Section 5.4.3 .

Analysis Engines and CAS Consumers are both instances of CAS Processors. A CPM may contain multiple CAS Processors. An Analysis Engine may be a Primitive or an Aggregate (composed of other Analysis Engines). Aggregates may contain Cas Consumers. While Collection Readers and CAS Initializers always run in the same JVM as the CPM, a CAS Processor may be deployed in a variety of local and distributed modes, providing a number of options for scalability and robustness. The different deployment options are covered in detail in Section 5.5 .

Each of the components in a CPE has an interface specified by the UIMA Collection Processing Architecture and is described by a declarative XML descriptor file. Similarly, the CPE itself has a well defined component interface and is described by a declarative XML descriptor file.

A user creates a CPE by assembling the components mentioned above. The UIMA SDK provides a graphical tool, called the CPE Configurator, for assisting in the assembly of CPEs. Use of this tool is summarized in Section 5.2 , and more details can be found in Chapter 13, Collection Processing Engine Configurator User's Guide. Alternatively, a CPE can be assembled by writing an XML CPE descriptor. Details on the CPE descriptor, including its syntax and content, can be found in the Chapter 24, Collection Processing Engine Descriptor Reference. The individual components have associated XML descriptors, each of which can be created and / or edited using the Component Description Editor.

A CPE is executed by a UIMA infrastructure component called the Collection Processing Manager (CPM). The CPM provides a number of services and deployment options that cover instantiation and execution of CPEs, error recovery, and local and distributed deployment of the CPE components.

Figure 12 illustrates the data flow that occurs between the different types of components that make up a CPE.

CPE Components

The components of a CPE are:

  • Collection Reader – interfaces to a collection of data items (e.g., documents) to be analyzed. Collection Readers return CASes that contain the documents to analyze, possibly along with additional metadata.
  • Analysis Engine – takes a CAS, analyzes its contents, and produces an enriched CAS. Analysis Engines can be recursively composed of other Analysis Engines (called an Aggregate Analysis Engine). Aggregates may also contain CAS Consumers.
  • CAS Consumer – consume the enriched CAS that was produced by the sequence of Analysis Engines before it, and produce an application-specific data structure, such as a search engine index or database.

A fourth type of component, the CAS Initializer, may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from <P> tags in the original HTML) into the CAS. The Collection Processing Manager orchestrates the data flow within a CPE, monitors status, optionally manages the life-cycle of internal components and collects statistics.

CASes are not saved in a persistent way by the framework. If you want to save CASes, then you have to save each CAS as it comes through (for example) using a CAS Consumer you write to do this, in whatever format you like. The UIMA SDK supplies an example CAS Consumer to save CASes to files, in the externalized XCAS format (an XML version of the CAS). It also supplies an example CAS Consumer to extract information from CASes and store the results into a relational Database, using Java's JDBC APIs.

Using the CPE Configurator

A CPE can be assembled by writing an XML CPE descriptor. Details on the CPE descriptor, including its syntax and content, can be found in Chapter 24, Collection Processing Engine Descriptor Reference. Rather than edit raw XML, you may develop a CPE Descriptor using the CPE Configurator tool. The CPE Configurator tool is described briefly in this section, and in more detail in Chapter 13, Collection Processing Engine Configurator User's Guide.

The CPE Configurator tool can be run from Eclipse (see Running the CPE Configurator from Eclipse ), or using the cpeGui shell script (cpeGui.bat on Windows, cpeGui.sh on Unix), which is located in the bin directory of the UIMA SDK installation. Executing this batch file will display the window shown here:

The window is divided into 4 sections, one each for the Collection Reader, CAS Initializer, Analysis Engines, and CAS Consumers. In each section, you select the component(s) you want to include in the CPE by browsing to their XML descriptors. The configuration parameters present in the XML descriptors will then be displayed in the GUI; these can be modified to override the values present in the descriptor. For example, the screen shot below shows the CPE Configurator after the following components have been chosen:

Collection Reader: %UIMA_HOME%docsexamplesdescriptorscollection_reader FileSystemCollectionReader.xml

Analysis Engine: %UIMA_HOME%docsexamplesdescriptorsanalysis_engineNamesAndPersonTitles_TAE.xml

CAS Consumer: %UIMA_HOME%docsexamplesdescriptorscas_consumerXCasWriterCasConsumer.xml

For the File System Collection Reader, ensure that the Input Directory is set to %UIMA_HOME%\docs\examples\data. The other parameters may be left blank. For the XCAS Writer CAS Consumer, ensure that the Output Directory is set to %UIMA_HOME%\docs\examples\data\processed.

After selecting each of the components and providing configuration settings, click the play (forward arrow) button at the bottom of the screen to begin processing. A progress bar should be displayed in the lower left corner. (Note that the progress bar will not begin to move until all components have completed their initialization, which may take several seconds.) Once processing has begun, the pause and stop buttons become enabled.

If an error occurs, you will be informed by an error dialog. If processing completes successfully, you will be presented with a performance report.

Using the File menu, you can select Save CPE Descriptor to create an .xml descriptor file that defines the CPE you have constructed. Later, you can use Open CPE Descriptor to restore the CPE Configurator to the saved state. Also, CPE descriptors can be used to run a CPE from a Java program – see section 5.3 . CPE Descriptors allow specifying operational parameters, such as error handling options, that are not currently available for configuration through the CPE Configurator. For more information on manually creating a CPE Descriptor, see the Chapter 24, Collection Processing Engine Descriptor Reference

Note that CPE descriptors identify which components comprise the CPE, but they do not capture the individual configuration settings for these components. That information is kept in the individual component descriptors. If you have made changes to these settings in the CPE Configurator tool and wish to save the settings back to the original descriptor files, use the File –> Save Component Configuration action.

The CPE configured above runs a simple name and title annotator on the sample data provided with the UIMA SDK and stores the results using the XCAS Writer CAS Consumer. To view the results, start the XCAS Annotation Viewer by running the xcasAnnotationViewer batch file (xcasAnnotationViewer.bat on Windows, xcasAnnotationViewer.sh on Unix), which is located in the bin directory of the UIMA SDK installation. Executing this batch file will display the window shown here:

Ensure that the Input Directory is the same as the Output Directory specified for the XCAS Writer CAS Consumer in the CPE configured above (e.g., %UIMA_HOME%\docs\examples\data\processed) and that the TAE Descriptor File is set to the Analysis Engine used in the CPE configured above (e.g., %UIMA_HOME%\docs\examples\descriptors\analysis_engine\NamesAndPersonTitles_TAE.xml).

Click the View button to display the Analyzed Documents window:

Double click on any document in the list to view the analyzed document. Double clicking the first document, IBM_LifeSciences.txt, will bring up the following window:

This window shows the analysis results for the document. Clicking on any highlighted annotation causes the details for that annotation to be displayed in the right-hand pane. Here the annotation spanning "John M. Thompson" has been clicked.

Congratulations! You have successfully configured a CPE, saved its descriptor, run the CPE, and viewed the analysis results.

Running the CPE Configurator from Eclipse

If you have followed the instructions in Chapter 3, UIMA SDK Setup for Eclipse and imported the example Eclipse project, then you should already have a Run configuration for the CPE Configurator tool (called UIMA CPE GUI) configured to run in the example project. Simply run that configuration to start the CPE Configurator.

If you haven’t followed the Eclipse setup instructions and wish to run the CPE Configurator tool from Eclipse, you will need to do the following. As installed, this Eclipse launch configuration is associated with the "uima_examples" project. If you've not already done so, you may wish to import that project into your Eclipse workspace. It's located in %UIMA_HOME%/docs/examples. Doing this will supply the Eclipse launcher with all the class files it needs to run the CPE configurator. If you don't do this, please manually add the JAR files for UIMA to the launch configuration.

Also, you need to add any projects or JAR files for any UIMA components you will be running to the launch class path.

  • A simpler alternative may be to change the CPE launch configuration to be based on your project. If you do that, it will pick up all the files in your project's class path, which you should set up to include all the UIMA framework files. An easy way to do this is to specify in your project's properties' build-path that the uima_examples project is on the build path.

Next, in the Eclipse menu select Run–> Run..., which brings up the Run configuration screen.

In the Main tab, set the main class to com.ibm.uima.reference_impl.application.cpm.CpmFrame

In the arguments tab, add the following to the VM arguments
-Xms128M -Xmx256M -Duima.home="C:\Program Files\IBM\uima" (or wherever you installed the UIMA SDK)

Click the Run button to launch the CPE Configurator, and use it as previously described in this section.

The simplest way to run a CPE from a Java application is to first create a CPE descriptor as described in the previous section. Then the CPE can be instantiated and run using the following code:

//parse CPE descriptor in file specified on command line CpeDescription cpeDesc = UIMAFramework.getXMLParser(). parseCpeDescription(new XMLInputSource(args[0]));

//instantiate CPE mCPE = UIMAFramework.produceCollectionProcessingEngine(cpeDesc);

//Create and register a Status Callback Listener mCPE.addStatusCallbackListener(new StatusCallbackListenerImpl());

//Start Processing mCPE.process();

This will start the CPE running in a separate thread.

Using Listeners

Updates of the CPM's progress, including any errors that occur, are sent to the callback handler that is registered by the call to addStatusCallbackListener, above. The callback handler is a class that implements the CPM's StatusCallbackListener interface. It responds to events by printing messages to the console. The source code is fairly straightforward and is not included in this chapter – see the com.ibm.uima.examples.cpe.SimpleRunCPE.java in the %UIMA_HOME%\docs\examples\src directory for the complete code.

If you need more control over the information in the CPE descriptor, you can manually configure it via its API. See the JavaDocs for package com.ibm.uima.collection for more details.

This section is an introduction to the process of developing Collection Readers, CAS Initializers, and CAS Consumers. The code snippets refer to the classes that can be found in %UIMA_HOME%\docs\examples\src example project.

In the following sections, classes you write to represent components need to be public and have public, 0-argument constructors, so that they can be instantiated by the framework. (Although Java classes in which you do not define any constructor will, by default, have a 0-argument constructor that doesn't do anything, a class in which you have defined at least one constructor does not get a default 0-argument constructor.)

Developing Collection Readers

A Collection Reader is responsible for obtaining documents from the collection and returning each document as a CAS. Like all UIMA components, a Collection Reader consists of two parts – the code and an XML descriptor.

A simple example of a Collection Reader is the "File System Collection Reader," which simply reads documents from files in a specified directory. The Java code is in the class com.ibm.uima.examples.cpe.FileSystemCollectionReader and the XML descriptor is %UIMA_HOME%\docs\examples\descriptors\collection_reader\FileSystemCollectionReader.xml.

Java Class

The Java class for a Collection Reader must implement the com.ibm.uima.collection.CollectionReader interface. You may build your Collection Reader from scratch and implement this interface, or you may extend the convenience base class com.ibm.uima.collection.CollectionReader_ImplBase.

The convenience base class provides default implementations for many of the methods defined in the CollectionReader interface, and provides abstract definitions for those methods that you are required to implement in your new Collection Reader. Note that if you extend this base class, you do not need to declare that your new Collection Reader implements the CollectionReader interface.

Eclipse tip – if you are using Eclipse, you can quickly create the boiler plate code and stubs for all of the required methods by clicking File –> New –> Class to bring up the "New Java Class" dialogue, specifying com.ibm.uima.collection.CollectionReader_ImplBase as the Superclass, and checking "Inherited abstract methods" in the section "Which method stubs would you like to create?", e.g.,

For the rest of this section we will assume that your new Collection Reader extends the CollectionReader_ImplBase class, and we will show examples from the com.ibm.uima.examples.cpe.FileSystemCollectionReader. If you must inherit from a different super class, you must ensure that your Collection Reader implements the CollectionReader interface – see the JavaDocs for CollectionReader for more details.

Required Methods

The following abstract methods must be implemented:

initialize()

The initialize() method is called by the framework when the Collection Reader is first created. CollectionReader_ImplBase actually provides a default implementation of this method (i.e., it is not abstract), so you are not strictly required to implement this method. However, a typical Collection Reader will implement this method to obtain parameter values and perform various initialization steps.

In this method, the Collection Reader class can access the values of its configuration parameters and perform other initialization logic. The example File System Collection Reader reads its configuration parameters and then builds a list of files in the specified input directory, as follows:

public void initialize() throws ResourceInitializationException { File directory = new File( (String)getConfigParameterValue(PARAM_INPUTDIR)); mEncoding = (String)getConfigParameterValue(PARAM_ENCODING); mDocumentTextXmlTagName = (String)getConfigParameterValue(PARAM_XMLTAG); mLanguage = (String)getConfigParameterValue(PARAM_LANGUAGE); mCurrentIndex = 0; //get list of files (not subdirectories) in the specified directory mFiles = new ArrayList(); File[] files = directory.listFiles(); for (int i = 0; i < files.length; i++) { if (!files[i].isDirectory()) { mFiles.add(files[i]); } } }

  • This is the zero-argument version of the initialize method. There is also a method on the Collection Reader interface called initialize(ResourceSpecifier, Map) but it is not recommended that you override this method in your code. That method performs internal initialization steps and then calls the zero-argument initialize().

hasNext()

The hasNext() method returns whether or not there are any documents remaining to be read from the collection. The File System Collection Reader's hasNext() method is very simple. It just checks if there are any more files left to be read:

public boolean hasNext() { return mCurrentIndex < mFiles.size(); }

getNext(CAS)

The getNext() method reads the next document from the collection and populates a CAS. In the simple case, this amounts to reading the file and calling the CAS's setDocumentText method. The example File System Collection Reader is slightly more complex. It first checks for a CAS Initializer. If the CPE includes a CAS Initializer, the CAS Initializer is used to read the document, and initialize() the CAS. If the CPE does not include a CAS Initializer, the File System Collection Reader reads the document and sets the document text in the CAS.

The File System Collection Reader also stores additional metadata about the document in the CAS. In particular, it sets the document's language in the special built-in feature structure uima.tcas.DocumentAnnotation (see Chapter 26, CAS Reference for details about this built-in type) and creates an instance of com.ibm.uima.examples.SourceDocumentInformation, which stores information about the document’s source location. This information may be useful to downstream components such as CAS Consumers. Note that the type system descriptor for this type can be found in com.ibm.uima.examples.SourceDocumentInformation.xml.

The getNext() method for the File System Collection Reader looks like this:

public void getNext(CAS aCAS) throws IOException, CollectionException { JCas jcas; try { jcas = aCAS.getJCas(); } catch (CASException e) { throw new CollectionException(e); } //open input stream to file File file = (File)mFiles.get(mCurrentIndex++); FileInputStream fis = new FileInputStream(file); try { //if there©s a CAS Initializer, call it

if (getCasInitializer() != null)

{ getCasInitializer().initializeCas(fis, aCAS); }

else //No CAS Initializer, so read file and set document //text here { byte[] contents = new byte[(int)file.length() ]; fis.read( contents ); String text; if (mEncoding != null) { text = new String(contents, mEncoding); } else { text = new String(contents); } //put document in CAS (assume this CAS is a view of a Text CAS) jcas.setDocumentText(text); } } finally { if (fis != null) fis.close(); } //set language if it was explicitly specified as a //configuration parameter if (mLanguage != null) { ((DocumentAnnotation)jcas.getDocumentAnnotationFs()) .setLanguage(mLanguage); } //Also store file location information in CAS metadata. //This information is critical //if CAS Consumers will need to know where the //original document contents are located.

//For example, the Semantic Search CAS Indexer writes this //information into the search index that it creates, which allows //applications that use the search index to //locate the documents that satisfy their semantic queries.

SourceDocumentInformation srcDocInfo = new SourceDocumentInformation(jcas); srcDocInfo.setUri(file.getAbsoluteFile().toURL().toString()); srcDocInfo.setOffsetInSource(0); srcDocInfo.setDocumentSize((int)file.length()); srcDocInfo.addToIndexes(); }

The Collection Reader can create additional annotations in the CAS at this point, in the same way that annotators create annotations. However, if you are doing complex initialization of the CAS, it may be better to use a CAS Initializer as described in Section 5.4.2 .

getProgress()

The Collection Reader is responsible for returning progress information; that is, how much of the collection has been read thus far and how much remains to be read. The framework defines progress very generally; the Collection Reader simply returns an array of Progress objects, where each object contains three fields – the amount already completed, the total amount (if known), and a unit (e.g. entities (documents), bytes, or files). The method returns an array so that the Collection Reader can report progress in multiple different units, if that information is available. The File System Collection Reader's getProgress() method looks like this:

public Progress[] getProgress() { return new Progress[]{ new ProgressImpl(mCurrentIndex,mFiles.size(),Progress.ENTITIES)}; }

In this particular example, the total number of files in the collection is known, but the total size of the collection is not known. As such, a ProgressImpl object for Progress.ENTITIES is returned, but a ProgressImpl object for Progress.BYTES is not.

close()

The close method is called when the Collection Reader is no longer needed. The Collection Reader should then release any resources it may be holding. The FileSystemCollectionReader does not hold resources and so has an empty implementation of this method:

public void close() throws IOException { }

Optional Methods

The following methods may be implemented:

reconfigure()

This method is called if the Collection Reader's configuration parameters change.

typeSystemInit()

If you are only setting the document text in the CAS, or if you are using the JCas (recommended, as in the current example), you do not have to implement this method. If you are directly using the CAS API, this method is used in the same way as it is used for an annotator – see Chapter 4, Annotator and Analysis Engine Developer’s Guidefor more information.

Threading considerations

Collection readers do not have to be thread safe; they are run with a single thread per instance, and only one instance per instance of the Collection Processing Manager (CPM) is made.

XML Descriptor

You can use the Component Description Editor to create and / or edit the File System Collection Reader's descriptor. Here is its descriptor (abbreviated somewhat to fit on a page), which is very similar to an Analysis Engine descriptor:

<collectionReaderDescription xmlns="http://uima.apache.org/resourceSpecifier"> <frameworkImplementation>com.ibm.uima.java</frameworkImplementation> <implementationName> com.ibm.uima.util.FileSystemCollectionReader </implementationName> <processingResourceMetaData> <name>File System Collection Reader</name> <description>Reads text files from the filesystem</description> <version>1.0</version> <vendor>IBM</vendor> <configurationParameters> <configurationParameter> <name>InputDirectory</name> <description>Directory containing input files</description> <type>String</type> <multiValued>false</multiValued> <mandatory>true</mandatory> </configurationParameter>

<!-- Other Configuration Parameters Omitted --> </configurationParameters>

<configurationParameterSettings> <nameValuePair> <name>InputDirectory</name> <value> <string>C:program filesuimadata</string> </value> </nameValuePair> </configurationParameterSettings> <!-- Type System of CASes returned by this Collection Reader --> <typeSystemDescription> <imports> <import name="com.ibm.uima.examples.SourceDocumentInformation"/> </imports> </typeSystemDescription> <capabilities> <capability> <inputs/> <outputs> <type allAnnotatorFeatures="true"> com.ibm.uima.examples.SourceDocumentInformation </type> </outputs> </capability> </capabilities> </processingResourceMetaData> </collectionReaderDescription>

Developing CAS Initializers

Although Collection Readers can directly write to the CAS, it is best that they do so only for simple cases. If the task of populating the CAS from a raw document is complex and might be reusable with other data collections, then it is worthwhile to encapsulate it in a separate CAS Initializer component.

An example where the use of a CAS Initializer is ideal is a scenario where the documents in the collection contain inline HTML or XML markup. Since Analysis Engines often ingest plain-text documents with stand-off annotations, it is necessary to translate the inline HTML or XML markup into this form. For example, an HTML document with inline <p> and <h1> tags could be translated into a CAS with a plain-text document and stand-off Paragraph and Heading annotations. Since this HTML parsing logic could be used regardless of the source of the HTML documents (e.g. a file system, a web connection, or a relational database), it would be ideal to implement this using a CAS Initializer that could be plugged-in to multiple Collection Readers.

A CAS Initializer Java class must implement the interface com.ibm.uima.collection.CasInitializer, and will also generally extend from the convenience base class com.ibm.uima.collection.CasInitializer_ImplBase. A CAS Initializer also must have an XML descriptor, which has the exact same form as a Collection Reader Descriptor except that the outer tag is <casInitializerDescription>.

CAS Initializers have optional initialize(), reconfigure(), and typeSystemInit() methods, which perform the same functions as they do for Collection Readers. The only required method for a CAS Initializer is initializeCas(Object, CAS). This method takes the raw document (for example, an InputStream object from which the document can be read) and a CAS, and populates the CAS from the document.

An example CAS Initializer is implemented by the class com.ibm.uima.examples.cpe.
SimpleXmlCasInitializer
. The SimpleXmlCasInitializer shows how a CAS Initializer can invoke an XML Parser on the raw document. In this very simple example the only thing extracted from the XML document is the text to be processed. You can configure the SimpleXmlCasInitializer with the name of an XML tag that contains the text; it will then filter out the rest of the document.

Here is the implementation of the initializeCas() method for this example:

public void initializeCas(Object aObj, CAS aCAS) throws CollectionException, IOException { //build SAX InputSource object from InputStream supplied //by the CollectionReader InputSource inputSource; if (aObj instanceof InputStream) { inputSource = new InputSource((InputStream)aObj); } else { throw new CollectionException( CollectionException.INCORRECT_INPUT_TO_CAS_INITIALIZER, new Object[]{InputStream.class.getName(), Obj.getClass().getName()}); } //create SAX ContentHandler that populates CAS SaxHandler handler = new SaxHandler(aCAS); //parse try { SAXParser parser = mParserFactory.newSAXParser(); XMLReader reader = parser.getXMLReader(); reader.setContentHandler(handler); reader.parse(inputSource); } catch (Exception e) { throw new CollectionException(e); } }

The SaxHandler class referenced here is an inner class that does the actual work of extracting the text from the specified XML element. For the full implementation, see the example code under docs/examples.

To try out the CAS Initializer, use the CPE Configurator GUI as described in section 13.3 . However, in addition to selecting a Collection Reader, Analysis Engine, and CAS Consumer as described in that section, also select a CAS Initializer by using the "Browse" button on the CAS Initializer panel. Browse to the %UIMA_HOME%/docs/examples/descriptors/cas_initializer directory and select the SimpleXmlCasInitializer.xml descriptor file. Then, set the "Xml Tag Containing Text" parameter to the value TEXT. The CPE Configurator should then look like this:

The SimpleXmlCasInitializer only works with XML documents, so you will need to change the "Input Directory" parameter of the Collection Reader by clicking the "Browse" button and selecting the %UIMA_HOME%/docs/examples/data/xml directory. Then click the play button. Once processing has completed, you can use the XCAS Annotation Viewer, as described in Chapter 20 , to view the results. Notice that only the contents of the <TEXT> elements in the original source documents appear in the analysis results.

It is important to note that CAS Initializers will only work with Collection Readers that are designed to use them. The Collection Reader needs to call its getCasInitializer() method to see if a CAS Initializer has been supplied, and call the CAS Initializer's initializeCas() method, rather than setting up the CAS itself. Our File System Collection Reader example from section 5.4.1 optionally uses a CAS Initializer as follows:

//if there is a CAS Initializer, call it if (getCasInitializer() != null) { getCasInitializer().initializeCas(fis, aCAS); } else //No CAS Initializer, so read file and set document text ourselves { ... }

When you write your own Collection Reader, in the description element of your Collection Reader's descriptor you should document whether your Collection Reader supports (or requires) a CAS Initializer, so that users will know how to configure their CPE properly.

Developing CAS Consumers

A CAS Consumer receives each CAS after it has been analyzed by the Analysis Engine. CAS Consumers typically do not update the CAS; they typically extract data from the CAS and persist selected information to aggregate data structures such as search engine indexes or databases.

A CAS Consumer Java class must implement the interface com.ibm.uima.collection.CasConsumer, and will also generally extend from the convenience base class com.ibm.uima.collection.CasConsumer_ImplBase. A CAS Consumer also must have an XML descriptor, which has the exact same form as a Collection Reader Descriptor except that the outer tag is <casConsumerDescription>.

CAS Consumers have optional initialize(), reconfigure(), and typeSystemInit() methods, which perform the same functions as they do for Collection Readers and CAS Initializers. The only required method for a CAS Consumer is processCas(CAS), which is where the CAS Consumer does the bulk of its work (i.e., consume the CAS).

The CasConsumer interface additionally defines batch and collection level processing methods. The CAS Consumer can implement the batchProcessComplete() method to perform processing that should occur at the end of each batch of CASes. Similarly, the CAS Consumer can implement the collectionProcessComplete() method to perform any collection level processing at the end of the collection.

A very simple example of a CAS Consumer, which writes an XML representation of the CAS to a file, is the XCAS Writer CAS Consumer. The Java code is in the class com.ibm.uima.examples.cpe.XCasWriterCasConsumer and the descriptor is in %UIMA_HOME%\docs\examples\descriptors\cas_consumer\XCasWriterCasConsumer.xml.

Required Methods

When extending the convenience class com.ibm.uima.collection.CasConsumer_ImplBase, the following abstract methods must be implemented:

initialize()

The initialize() method is called by the framework when the CAS Consumer is first created. CasConsumer_ImplBase actually provides a default implementation of this method (i.e., it is not abstract), so you are not strictly required to implement this method. However, a typical CAS Consumer will implement this method to obtain parameter values and perform various initialization steps.

In this method, the CAS Consumer can access the values of its configuration parameters and perform other initialization logic. The example XCAS Writer CAS Consumer reads its configuration parameters and sets up the output directory:

public void initialize() throws ResourceInitializationException { mDocNum = 0; mOutputDir = new File((String)getConfigParameterValue(PARAM_OUTPUTDIR)); if (!mOutputDir.exists()) { mOutputDir.mkdirs(); } }

processCas()

The processCas() method is where the CAS Consumer does most of its work. In our example, the XCAS Writer CAS Consumer obtains an iterator over the document metadata in the CAS (in the SourceDocumentInformation feature structure, which is created by the File System Collection Reader) and extracts the URI for the current document. From this the output filename is constructed in the output directory and a subroutine (writeXCas) is called to generate the output file. The writeXCas subroutine uses the XCASSerializer class provided with the UIMA SDK to serialize the CAS to the output file (see the example source code for details).

public void processCas(CAS aCAS) throws ResourceProcessException { JCas jcas; try { jcas = aCAS.getJCas(); } catch (CASException e) { throw new ResourceProcessException(e); }

// retrieve the filename of the input file from the CAS

FSIterator it = jcas.getJFSIndexRepository(). getAnnotationIndex( SourceDocumentInformation.type).iterator(); File outFile = null; if (it.hasNext()) { SourceDocumentInformation fileLoc = (SourceDocumentInformation)it.next(); File inFile; try { inFile = new File(new URL(fileLoc.getUIR()).getPath()); outFile = new File(mOutputDir, inFile.getName()); } catch (MalformedURLException e1) { // invalid URL, use default processing below } } if (null == outFile) { outFile = new File(mOutputDir, "doc"+ mDocNum++); } // serialize XCAS and write to output file try { writeXCas(jcas.getCas(), outFile); } catch (IOException e) { throw new ResourceProcessException(e); } catch (SAXException e) { throw new ResourceProcessException(e); } }

Optional Methods

The following methods are optional in a CAS Consumer, though they are often used.

batchProcessComplete()

The framework calls the batchProcessComplete() method at the end of each batch of CASes. This gives the CAS Consumer an opportunity to perform any batch level processing. Our simple XCAS Writer CAS Consumer does not perform any batch level processing, so this method is empty. Batch size is set in the Collection Processing Engine descriptor.

collectionProcessComplete()

The framework calls the collectionProcessComplete() method at the end of the collection (i.e., when all objects in the collection have been processed). At this point in time, no CAS is passed in as a parameter. This gives the CAS Consumer an opportunity to perform collection processing over the entire set of objects in the collection. Our simple XCAS Writer CAS Consumer does not perform any collection level processing, so this method is empty.

The CPM provides a number of service and deployment options that cover instantiation and execution of CPEs, error recovery, and local and distributed deployment of the CPE components. The behavior of the CPM (and correspondingly, the CPE) is controlled by various options and parameters set in the CPE descriptor. The current version of the CPE Configurator tool, however, supports only default error handling and deployment options. To change these options, you must manually edit the CPE descriptor – a potentially error prone task.

Eventually the CPE Configurator tool will support configuring these options and a detailed tutorial for these settings will be provided. In the meantime, we provide only a high-level, conceptual overview of these advanced features in the rest of this chapter, and refer the advanced user to Chapter 24, Collection Processing Engine Descriptor Reference for details on setting these options in the CPE Descriptor.

Figure nn shows a logical view of how an application uses the UIMA framework to instantiate a CPE from a CPE descriptor. The CPE descriptor identifies the CPE components (referencing their corresponding descriptors) and specifies the various options for configuring the CPM and deploying the CPE components.

CPE instantiation

There are three deployment modes for CAS Processors (Analysis Engines and CAS Consumers) in a CPE:

  1. Integrated (runs in the same Java instance as the CPM)
  2. Managed (runs in a separate process on the same machine), and
  3. Non-managed (runs in a separate process, perhaps on a different machine).

An integrated CAS Processor runs in the same JVM as the CPE. A managed CAS Processor runs in a separate process from the CPE, but still on the same computer. The CPE controls startup, shutdown, and recovery of a managed CAS Processor. A non-managed CAS Processor runs as a service and may be on the same computer as the CPE or on a remote computer. A non-managed CAS Processor service is started and managed independently from the CPE.

For both managed and non-managed CAS Processors, the CAS must be transmitted between separate processes and possibly between separate computers. This is accomplished using Vinci, a communication protocol used by the CPM that ships with the UIMA SDK. Vinci handles service naming and location and data transport (see 6.6.2, How to Deploy a UIMA Component as a Vinci Service for more information). Service naming and location are provided by a Vinci Naming Service, or VNS. For managed CAS Processors, the CPE uses its own internal VNS. For non-managed CAS Processors, a separate VNS must be running.

  • The UIMA SDK also supports using unmanaged remote services via the web-standard SOAP communications protocol (see How to Deploy a UIMA Component as a SOAP Web Service ). This approach is based on a proxy implementation, where the proxy is essentially running in an integrated mode. To use this approach with the CPM, use the Integrated mode, with the component being an Aggregate which, in turn, connects to a remote service.

The CPE Configurator tool currently only supports constructing CPEs that deploy CAS Processors in integrated mode. To deploy CAS Processors in any other mode, the CPE descriptor must be edited by hand (better tooling support is being worked on). Details on the CPE descriptor and the required settings for various CAS Processor deployment modes can be found in Chapter 24, Collection Processing Engine Descriptor Reference. In the following sections we merely summarize the various CAS Processor deployment options.

Deploying Managed CAS Processors

Managed CAS Processor deployment is shown in Figure nn. A managed CAS Processor is deployed by the CPE as a Vinci service. The CPE manages the lifecycle of the CAS Processor including service launch, restart on failures, and service shutdown. A managed CAS Processor runs on the same machine as the CPE, but in a separate process. This provides the necessary fault isolation for the CPE to protect it from non-robust CAS Processors. A fatal failure of a managed CAS Processor does not threaten the stability of the CPE.

CPE with managed CAS Processors

The CPE communicates with managed CAS Processors using the Vinci communication protocol. A CAS Processor is launched as a Vinci service and its process() method is invoked remotely via a Vinci command. The CPE uses its own internal VNS to support managed CAS processors. The VNS, by default, listens on port 9005. If this port is not available, the VNS will increment its listen port until it finds one that is available. All managed CAS Processors are internally configured to "talk" to the CPE managed VNS. This internal VNS is transparent to the end user launching the CPE.

To deploy a managed CAS Processor, the CPE deployer must change the CPE descriptor. The following is a section from the CPE descriptor that shows an example configuration specifying a managed CAS Processor.

<casProcessor deployment="local" name="Meeting Detector TAE"> <descriptor> <include href="deploy/vinci/Deploy_MeetingDetectorTAE.xml"/> </descriptor> <runInSeparateProcess> <exec dir="." executable="java"> <env key="CLASSPATH" value="src;C:/Program Files/apache-uima/lib/uima_core.jar;C:/Program Files/IBM/uima/lib/uima_cpe.jar;C:/Program Files/apache-uima/lib/uima_examples.jar;C:/Program Files/apache-uima/lib/uima_adapter_vinci.jar;C:/Program Files/apache-uima/lib/uima_jcas_builtin_types.jar;C:/Program Files/apache-uima/lib/vinci/jVinci.jar;C:/Program Files/apache-uima/lib/xml.jar"/> <arg>-DLOG=C:/Temp/service.log</arg> <arg>com.ibm.uima.reference_impl.collection. service.vinci.VinciCasObjectProcessorService_impl</arg> <arg>${descriptor}</arg> </exec> </runInSeparateProcess> <deploymentParameters/> <filter/> <errorHandling> <errorRateThreshold action="terminate" value="1/100"/> <maxConsecutiveRestarts action="terminate" value="3"/> <timeout max="100000"/> </errorHandling> <checkpoint batch="10000"/> </casProcessor>

See Chapter 24, Collection Processing Engine Descriptor Reference for details and required settings.

Deploying Non-managed CAS Processors

Non-managed CAS Processor deployment is shown in Figure nn. In non-managed mode, the CPE supports connectivity to CAS Processors running on local or remote computers using Vinci. Non-managed processors are different from managed processors in two aspects:

  1. Non-managed processors are neither started nor stopped by the CPE.
  2. Non-managed processors use an independent VNS, also neither started nor stopped by the CPE.

CPE with non-managed CAS Processors

While non-managed CAS Processors provide the same level of fault isolation and robustness as managed CAS Processors, error recovery support for non-managed CAS Processors is much more limited. In particular, the CPE cannot restart a non-managed CAS Processor after an error.

Non-managed CAS Processors also require a separate Vinci Naming Service running on the network. This VNS must be manually started and monitored by the end user or application. Instructions for running a VNS can be found in section 6.6.5 Starting VNS, .

To deploy a non-managed CAS Processor, the CPE deployer must change the CPE descriptor. The following is a section from the CPE descriptor that shows an example configuration for the non-managed CAS Processor.

<casProcessor deployment="remote" name="Meeting Detector TAE"> <descriptor> <include href= "descriptors/vinciService/MeetingDetectorVinciService.xml"/> </descriptor> <deploymentParameters/> <filter/> <errorHandling> <errorRateThreshold action="terminate" value="1/100"/> <maxConsecutiveRestarts action="terminate" value="3"/> <timeout max="100000"/> </errorHandling> <checkpoint batch="10000"/> </casProcessor>

See Chapter 24, Collection Processing Engine Descriptor Reference for details and required settings.

Deploying Integrated CAS Processors

Integrated CAS Processors are shown in Figure 16. Here the CAS Processors run in the same JVM as the CPE, just like the Collection Reader and CAS Initializer. This deployment method results in minimal CAS communication and transport overhead as the CAS is shared in the same process space of the JVM. However, a CPE running with all integrated CAS Processors is limited in scalability by the capability of the single computer on which the CPE is running. There is also a stability risk associated with integrated processors because a poorly written CAS Processor can cause the JVM, and hence the entire CPE, to abort.

CPE with integrated CAS Processor

The following is a section from a CPE descriptor that shows an example configuration for the integrated CAS Processor.

<casProcessor deployment="integrated" name="Meeting Detector TAE"> <descriptor> <include href="descriptors/tutorial/ex4/MeetingDetectorTAE.xml"/> </descriptor> <deploymentParameters/> <filter/> <errorHandling> <errorRateThreshold action="terminate" value="100/1000"/> <maxConsecutiveRestarts action="terminate" value="30"/> <timeout max="100000"/> </errorHandling> <checkpoint batch="10000"/> </casProcessor>

See Chapter 24, Collection Processing Engine Descriptor Reference for details and required settings.

The UIMA SDK includes a set of examples illustrating the three modes of deployment, integrated, managed, and non-managed. These are in the /docs/examples/descriptors/collection_processing_engine directory. There are three CPE descriptors that run an example annotator (the Meeting Finder) in these modes.

To run either the integrated or managed examples, use the runCPE script in the /bin directory of the UIMA installation, passing the appropriate CPE descriptor as an argument.

  • The runCPE script must be run from the %UIMA_HOME%\docs\examples directory, because it uses relative path names that are resolved relative to this working directory. For instance,

runCPE descriptors\collection_processing_engine\MeetingFinderCPE_Integrated.xml

If you installed the examples into Eclipse, you can run directly from Eclipse by creating a run configuration. To do this, highlight the SimpleRunCPE.java source file in the examples src/com/ibm/uima/examples/cpe directory, and then

  1. pick the menu Run -> Run... Select
  2. click "Java Application" and press "New"
  3. click on the Arguments panel, and insert a path to the appropriate CPE descriptor in the "Program Arguments" box by typing, for instance: descriptors/collection_processing_engine/MeetingFinderCPE_Integrated.xml
  4. Then press "Run"

To run the non-managed example, there are some additional steps.

  1. Start a VNS service by running the startVNS script in the /bin directory.
  2. Deploy the Meeting Detector Analysis Engine as a Vinci service, by running the startVinciService script in the /bin directory, and passing it the location of the descriptor to deploy, in this case %UIMA_HOME%/docs/examples/deploy/vinci/Deploy_MeetingDetectorTAE.xml
  3. Now, run the runCPE script, passing it the CPE for the non-managed version (%UIMA_HOME%/docs/examples/descriptors/collection_processing_engine/MeetingFinderCPE_NonManaged.xml).

This assumes that the Vinci Naming Service, the runCPE application, and the MeetingDetectorTAE service are all running on the same machine. Most of the scripts that need information about VNS will look for values to use in environment variables VNS_HOST and VNS_PORT; these default to "localhost" and "9000". You may set these to appropriate values before running the scripts, as needed; you can also pass the name of the VNS host as the 2nd argument to the startVinciService script.

Alternatively, you can edit the scripts and/or the XML files to specify alternatives for the VNS_HOST and VNS_PORT. For instance, if the runCPE application is running on a different machine from the Vinci Naming Service, you can edit the MeetingFinderCPE_NonManaged.xml and change the vnsHost parameter:

<parameter name="vnsHost" value="localhost" type="string"/>

to specify the VNS host instead of "localhost".