The UIMA Analysis Engine interface provides support for developing and integrating algorithms that analyze unstructured data. Analysis Engines are designed to operate on a per-document basis. Their interface handles one CAS at a time. UIMA provides additional support for applying analysis engines to collections of unstructured data with its Collection Processing Architecture. The Collection Processing Architecture defines additional components for reading raw data formats from data collections, preparing the data for processing by Analysis Engines, executing the analysis, extracting analysis results, and deploying the overall flow in a variety of local and distributed configurations.
The functionality defined in the Collection Processing Architecture is implemented by a Collection Processing Engine (CPE). A CPE includes an Analysis Engine and adds a Collection Reader, a CAS Initializer, and CAS Consumers. The part of the UIMA Framework that supports CPEs is called the Collection Processing Manager, or CPM.
A Collection Reader provides the interface to the raw input data and knows how to iterate over the data collection. Collection Readers are discussed in Section 5.4.1 . The CAS Initializer prepares an individual data item for analysis and loads it into the CAS. CAS Initializers are discussed in Section 5.4.2 . A CAS Consumer extracts analysis results from the CAS and may also perform collection level processing, or analysis over a collection of CASes. CAS Consumers are discussed in Section 5.4.3 .
Analysis Engines and CAS Consumers are both instances of CAS Processors. A CPM may contain multiple CAS Processors. An Analysis Engine may be a Primitive or an Aggregate (composed of other Analysis Engines). Aggregates may contain Cas Consumers. While Collection Readers and CAS Initializers always run in the same JVM as the CPM, a CAS Processor may be deployed in a variety of local and distributed modes, providing a number of options for scalability and robustness. The different deployment options are covered in detail in Section 5.5 .
Each of the components in a CPE has an interface specified by the UIMA Collection Processing Architecture and is described by a declarative XML descriptor file. Similarly, the CPE itself has a well defined component interface and is described by a declarative XML descriptor file.
A user creates a CPE by assembling the components mentioned above. The UIMA SDK provides a graphical tool, called the CPE Configurator, for assisting in the assembly of CPEs. Use of this tool is summarized in Section 5.2 , and more details can be found in Chapter 13, Collection Processing Engine Configurator User's Guide. Alternatively, a CPE can be assembled by writing an XML CPE descriptor. Details on the CPE descriptor, including its syntax and content, can be found in the Chapter 24, Collection Processing Engine Descriptor Reference. The individual components have associated XML descriptors, each of which can be created and / or edited using the Component Description Editor.
A CPE is executed by a UIMA infrastructure component called the Collection Processing Manager (CPM). The CPM provides a number of services and deployment options that cover instantiation and execution of CPEs, error recovery, and local and distributed deployment of the CPE components.
Figure 12 illustrates the data flow that occurs between the different types of components that make up a CPE.
The components of a CPE are:
A fourth type of component, the CAS Initializer,
may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML
parser that de-tags an HTML document and also inserts paragraph annotations
(determined from <P>
tags in the original
HTML) into the CAS. The Collection
Processing Manager orchestrates the data flow within a CPE, monitors status,
optionally manages the life-cycle of internal components and collects
statistics.
CASes are not saved in a persistent way by the framework. If you want to save CASes, then you have to save each CAS as it comes through (for example) using a CAS Consumer you write to do this, in whatever format you like. The UIMA SDK supplies an example CAS Consumer to save CASes to files, in the externalized XCAS format (an XML version of the CAS). It also supplies an example CAS Consumer to extract information from CASes and store the results into a relational Database, using Java's JDBC APIs.
A CPE can be assembled by writing an XML CPE descriptor. Details on the CPE descriptor, including its syntax and content, can be found in Chapter 24, Collection Processing Engine Descriptor Reference. Rather than edit raw XML, you may develop a CPE Descriptor using the CPE Configurator tool. The CPE Configurator tool is described briefly in this section, and in more detail in Chapter 13, Collection Processing Engine Configurator User's Guide.
The CPE Configurator tool can be run from Eclipse (see Running the CPE
Configurator from Eclipse ), or using the cpeGui
shell
script (cpeGui.bat
on Windows, cpeGui.sh
on Unix), which is located in the bin
directory of
the UIMA SDK installation. Executing
this batch file will display the window shown here:
The window is divided into 4 sections, one each for the Collection Reader, CAS Initializer, Analysis Engines, and CAS Consumers. In each section, you select the component(s) you want to include in the CPE by browsing to their XML descriptors. The configuration parameters present in the XML descriptors will then be displayed in the GUI; these can be modified to override the values present in the descriptor. For example, the screen shot below shows the CPE Configurator after the following components have been chosen:
Collection Reader: %UIMA_HOME%docsexamplesdescriptorscollection_reader FileSystemCollectionReader.xml
Analysis Engine: %UIMA_HOME%docsexamplesdescriptorsanalysis_engineNamesAndPersonTitles_TAE.xml
CAS Consumer: %UIMA_HOME%docsexamplesdescriptorscas_consumerXCasWriterCasConsumer.xml
For the File System Collection Reader, ensure that the
Input Directory is set to %UIMA_HOME%\docs\examples\data
. The other parameters may be left blank. For the XCAS Writer CAS Consumer, ensure that
the Output Directory is set to %UIMA_HOME%\docs\examples\data\processed
.
After selecting each of the components and providing configuration settings, click the play (forward arrow) button at the bottom of the screen to begin processing. A progress bar should be displayed in the lower left corner. (Note that the progress bar will not begin to move until all components have completed their initialization, which may take several seconds.) Once processing has begun, the pause and stop buttons become enabled.
If an error occurs, you will be informed by an error dialog. If processing completes successfully, you will be presented with a performance report.
Using the File menu, you can select Save
CPE Descriptor
to create an .xml descriptor file that defines the CPE
you have constructed. Later, you can use
Open CPE Descriptor
to restore the CPE Configurator
to the saved state. Also, CPE
descriptors can be used to run a CPE from a Java program – see section 5.3
. CPE
Descriptors allow specifying operational parameters, such as error handling
options, that are not currently available for configuration through the CPE
Configurator. For more information on
manually creating a CPE Descriptor, see the Chapter
24, Collection
Processing Engine Descriptor Reference
Note that CPE descriptors identify which components
comprise the CPE, but they do not capture the individual configuration settings
for these components. That information
is kept in the individual component descriptors. If you have made changes to these settings in
the CPE Configurator tool and wish to save the settings back to the original
descriptor files, use the File
–>
Save Component Configuration
action.
The CPE configured above runs a simple name and title
annotator on the sample data provided with the UIMA SDK and stores the results
using the XCAS Writer CAS Consumer. To
view the results, start the XCAS Annotation Viewer by running the xcasAnnotationViewer
batch file (xcasAnnotationViewer.bat
on Windows, xcasAnnotationViewer.sh
on Unix), which
is located in the bin
directory of the UIMA SDK
installation. Executing this batch file
will display the window shown here:
Ensure that the Input Directory is the same as the Output
Directory specified for the XCAS Writer CAS Consumer in the CPE configured
above (e.g., %UIMA_HOME%\docs\examples\data\processed
)
and that the TAE Descriptor File is set to the Analysis Engine used in the CPE
configured above (e.g., %UIMA_HOME%\docs\examples\descriptors\analysis_engine\NamesAndPersonTitles_TAE.xml
).
Click the View button to display the Analyzed Documents window:
Double click on any document in the list to view the analyzed document. Double clicking the first document, IBM_LifeSciences.txt, will bring up the following window:
This window shows the analysis results for the document. Clicking on any highlighted annotation causes the details for that annotation to be displayed in the right-hand pane. Here the annotation spanning "John M. Thompson" has been clicked.
Congratulations! You have successfully configured a CPE, saved its descriptor, run the CPE, and viewed the analysis results.
If you have followed the instructions in Chapter
3, UIMA
SDK Setup for Eclipse and imported the example Eclipse project, then
you should already have a Run configuration for the CPE Configurator tool
(called UIMA CPE GUI
) configured to run in the
example project. Simply run that
configuration to start the CPE Configurator.
If you haven’t followed the Eclipse setup instructions and wish to run the CPE Configurator tool from Eclipse, you will need to do the following. As installed, this Eclipse launch configuration is associated with the "uima_examples" project. If you've not already done so, you may wish to import that project into your Eclipse workspace. It's located in %UIMA_HOME%/docs/examples. Doing this will supply the Eclipse launcher with all the class files it needs to run the CPE configurator. If you don't do this, please manually add the JAR files for UIMA to the launch configuration.
Also, you need to add any projects or JAR files for any UIMA components you will be running to the launch class path.
Next, in the Eclipse menu select Run
–>
Run
..., which brings up the Run configuration screen.
In the Main tab, set the main
class to com.ibm.uima.reference_impl.application.cpm.CpmFrame
In the arguments tab, add the
following to the VM arguments
-Xms128M -Xmx256M -Duima.home="C:\Program
Files\IBM\uima"
(or wherever you installed the UIMA SDK)
Click the Run button to launch the CPE Configurator, and use it as previously described in this section.
The simplest way to run a CPE from a Java application is to first create a CPE descriptor as described in the previous section. Then the CPE can be instantiated and run using the following code:
//parse CPE descriptor in file specified on command line CpeDescription cpeDesc = UIMAFramework.getXMLParser(). parseCpeDescription(new XMLInputSource(args[0]));
//instantiate CPE mCPE = UIMAFramework.produceCollectionProcessingEngine(cpeDesc);
//Create and register a Status Callback Listener mCPE.addStatusCallbackListener(new StatusCallbackListenerImpl());
//Start Processing mCPE.process();
This will start the CPE running in a separate thread.
Updates of the CPM's progress, including any errors that
occur, are sent to the callback handler that is registered by the call to addStatusCallbackListener
, above. The callback handler is a class that
implements the CPM's StatusCallbackListener
interface. It responds to events by
printing messages to the console. The
source code is fairly straightforward and is not included in this chapter – see
the com.ibm.uima.examples.cpe.SimpleRunCPE.java
in
the %UIMA_HOME%\docs\examples\src
directory for the
complete code.
If you need more control over the information in the CPE
descriptor, you can manually configure it via its API. See the JavaDocs for package com.ibm.uima.collection
for more details.
This section is an introduction to the process of
developing Collection Readers, CAS Initializers, and CAS Consumers. The code snippets refer to the classes that
can be found in %UIMA_HOME%\docs\examples\src
example
project.
In the following sections, classes you write to represent components need to be public and have public, 0-argument constructors, so that they can be instantiated by the framework. (Although Java classes in which you do not define any constructor will, by default, have a 0-argument constructor that doesn't do anything, a class in which you have defined at least one constructor does not get a default 0-argument constructor.)
A Collection Reader is responsible for obtaining documents from the collection and returning each document as a CAS. Like all UIMA components, a Collection Reader consists of two parts – the code and an XML descriptor.
A simple example of a Collection Reader is the "File
System Collection Reader," which simply reads documents from files in a
specified directory. The Java code is in the class com.ibm.uima.examples.cpe.FileSystemCollectionReader
and the XML descriptor is %UIMA_HOME%\docs\examples\descriptors\collection_reader\FileSystemCollectionReader.xml
.
The Java class for a Collection Reader must implement the com.ibm.uima.collection.CollectionReader
interface. You may build your Collection Reader from
scratch and implement this interface, or you may extend the convenience base
class com.ibm.uima.collection.CollectionReader_ImplBase
.
The convenience base class provides default
implementations for many of the methods defined in the CollectionReader
interface, and provides abstract definitions for those methods that you are
required to implement in your new Collection Reader. Note that if you extend this base class, you
do not need to declare that your new Collection Reader implements the CollectionReader
interface.
Eclipse tip – if you are using Eclipse, you can quickly
create the boiler plate code and stubs for all of the required methods by
clicking Fi
le
–>
New
–>
Class
to bring up the "New Java Class" dialogue, specifying com.ibm.uima.collection.CollectionReader_ImplBase
as the
Superclass, and checking "Inherited abstract methods" in the section "Which
method stubs would you like to create?", e.g.,
For the rest of this section we will assume that your new
Collection Reader extends the CollectionReader_ImplBase
class, and we will show examples from the com.ibm.uima.examples.cpe.FileSystemCollectionReader
. If you must inherit from a different super
class, you must ensure that your Collection Reader implements the CollectionReader
interface – see the JavaDocs for CollectionReader
for more details.
The following abstract methods must be implemented:
The initialize()
method is
called by the framework when the Collection Reader is first created. CollectionReader_ImplBase
actually provides a default implementation of this method (i.e., it is not
abstract), so you are not strictly required to implement this method. However, a typical Collection Reader will
implement this method to obtain parameter values and perform various
initialization steps.
In this method, the Collection Reader class can access the values of its configuration parameters and perform other initialization logic. The example File System Collection Reader reads its configuration parameters and then builds a list of files in the specified input directory, as follows:
public void initialize() throws ResourceInitializationException { File directory = new File( (String)getConfigParameterValue(PARAM_INPUTDIR)); mEncoding = (String)getConfigParameterValue(PARAM_ENCODING); mDocumentTextXmlTagName = (String)getConfigParameterValue(PARAM_XMLTAG); mLanguage = (String)getConfigParameterValue(PARAM_LANGUAGE); mCurrentIndex = 0; //get list of files (not subdirectories) in the specified directory mFiles = new ArrayList(); File[] files = directory.listFiles(); for (int i = 0; i < files.length; i++) { if (!files[i].isDirectory()) { mFiles.add(files[i]); } } }
initialize(ResourceSpecifier, Map)
but it is not
recommended that you override this method in your code. That method performs internal initialization
steps and then calls the zero-argument initialize()
.
The hasNext()
method returns
whether or not there are any documents remaining to be read from the collection. The File System Collection Reader's hasNext()
method is very simple. It just checks if there are any more files
left to be read:
public boolean hasNext() { return mCurrentIndex < mFiles.size(); }
The getNext()
method reads the
next document from the collection and populates a CAS. In the simple case, this amounts to reading
the file and calling the CAS's setDocumentText
method. The example File System
Collection Reader is slightly more complex. It first checks for a CAS Initializer. If the CPE includes a CAS Initializer, the CAS Initializer is used to
read the document, and initialize()
the CAS. If the CPE does not include a CAS
Initializer, the File System Collection Reader reads the document and sets the
document text in the CAS.
The File System Collection Reader also stores additional
metadata about the document in the CAS. In particular, it sets the document's language in the special built-in
feature structure uima.tcas.DocumentAnnotation
(see
Chapter 26, CAS Reference for details about this built-in type) and
creates an instance of com.ibm.uima.examples.SourceDocumentInformation
,
which stores information about the document’s source location. This information may be useful to downstream
components such as CAS Consumers. Note
that the type system descriptor for this type can be found in com.ibm.uima.examples.SourceDocumentInformation.xml
.
The getNext() method for the File System Collection Reader looks like this:
public void getNext(CAS aCAS) throws IOException, CollectionException { JCas jcas; try { jcas = aCAS.getJCas(); } catch (CASException e) { throw new CollectionException(e); } //open input stream to file File file = (File)mFiles.get(mCurrentIndex++); FileInputStream fis = new FileInputStream(file); try { //if there©s a CAS Initializer, call it
if (getCasInitializer() != null)
{ getCasInitializer().initializeCas(fis, aCAS); }
else //No CAS Initializer, so read file and set document //text here { byte[] contents = new byte[(int)file.length() ]; fis.read( contents ); String text; if (mEncoding != null) { text = new String(contents, mEncoding); } else { text = new String(contents); } //put document in CAS (assume this CAS is a view of a Text CAS) jcas.setDocumentText(text); } } finally { if (fis != null) fis.close(); } //set language if it was explicitly specified as a //configuration parameter if (mLanguage != null) { ((DocumentAnnotation)jcas.getDocumentAnnotationFs()) .setLanguage(mLanguage); } //Also store file location information in CAS metadata. //This information is critical //if CAS Consumers will need to know where the //original document contents are located.
//For example, the Semantic Search CAS Indexer writes this //information into the search index that it creates, which allows //applications that use the search index to //locate the documents that satisfy their semantic queries.
SourceDocumentInformation srcDocInfo = new SourceDocumentInformation(jcas); srcDocInfo.setUri(file.getAbsoluteFile().toURL().toString()); srcDocInfo.setOffsetInSource(0); srcDocInfo.setDocumentSize((int)file.length()); srcDocInfo.addToIndexes(); }
The Collection Reader can create additional annotations in the CAS at this point, in the same way that annotators create annotations. However, if you are doing complex initialization of the CAS, it may be better to use a CAS Initializer as described in Section 5.4.2 .
The Collection Reader is responsible for returning
progress information; that is, how much of the collection has been read thus
far and how much remains to be read. The
framework defines progress very generally; the Collection Reader simply returns
an array of Progress
objects, where each object
contains three fields – the amount already completed, the total amount (if
known), and a unit (e.g. entities (documents), bytes, or files). The method returns an array so that the
Collection Reader can report progress in multiple different units, if that
information is available. The File
System Collection Reader's getProgress()
method
looks like this:
public Progress[] getProgress() { return new Progress[]{ new ProgressImpl(mCurrentIndex,mFiles.size(),Progress.ENTITIES)}; }
In this particular example, the total number of files in
the collection is known, but the total size of the collection is not
known. As such, a ProgressImpl
object for Progress.ENTITIES
is returned, but a ProgressImpl
object for Progress.BYTES
is not.
The close method is called when the Collection Reader is no longer needed. The Collection Reader should then release any resources it may be holding. The FileSystemCollectionReader does not hold resources and so has an empty implementation of this method:
public void close() throws IOException { }
The following methods may be implemented:
This method is called if the Collection Reader's configuration parameters change.
If you are only setting the document text in the CAS, or if you are using the JCas (recommended, as in the current example), you do not have to implement this method. If you are directly using the CAS API, this method is used in the same way as it is used for an annotator – see Chapter 4, Annotator and Analysis Engine Developer’s Guidefor more information.
Collection readers do not have to be thread safe; they are run with a single thread per instance, and only one instance per instance of the Collection Processing Manager (CPM) is made.
You can use the Component Description Editor to create and / or edit the File System Collection Reader's descriptor. Here is its descriptor (abbreviated somewhat to fit on a page), which is very similar to an Analysis Engine descriptor:
<collectionReaderDescription xmlns="http://uima.apache.org/resourceSpecifier"> <frameworkImplementation>com.ibm.uima.java</frameworkImplementation> <implementationName> com.ibm.uima.util.FileSystemCollectionReader </implementationName> <processingResourceMetaData> <name>File System Collection Reader</name> <description>Reads text files from the filesystem</description> <version>1.0</version> <vendor>IBM</vendor> <configurationParameters> <configurationParameter> <name>InputDirectory</name> <description>Directory containing input files</description> <type>String</type> <multiValued>false</multiValued> <mandatory>true</mandatory> </configurationParameter>
<!-- Other Configuration Parameters Omitted --> </configurationParameters>
<configurationParameterSettings> <nameValuePair> <name>InputDirectory</name> <value> <string>C:program filesuimadata</string> </value> </nameValuePair> </configurationParameterSettings> <!-- Type System of CASes returned by this Collection Reader --> <typeSystemDescription> <imports> <import name="com.ibm.uima.examples.SourceDocumentInformation"/> </imports> </typeSystemDescription> <capabilities> <capability> <inputs/> <outputs> <type allAnnotatorFeatures="true"> com.ibm.uima.examples.SourceDocumentInformation </type> </outputs> </capability> </capabilities> </processingResourceMetaData> </collectionReaderDescription>
Although Collection Readers can directly write to the CAS, it is best that they do so only for simple cases. If the task of populating the CAS from a raw document is complex and might be reusable with other data collections, then it is worthwhile to encapsulate it in a separate CAS Initializer component.
An example where the use of a CAS Initializer is ideal is a scenario
where the documents in the collection contain inline HTML or XML markup. Since Analysis Engines often ingest
plain-text documents with stand-off annotations, it is necessary to translate
the inline HTML or XML markup into this form. For example, an HTML document with inline <p>
and <h1>
tags could be translated into a CAS with a
plain-text document and stand-off Paragraph
and Heading
annotations. Since this HTML parsing logic could be used regardless of the source of
the HTML documents (e.g. a file system, a web connection, or a relational
database), it would be ideal to implement this using a CAS Initializer that
could be plugged-in to multiple Collection Readers.
A CAS Initializer Java class must implement the interface com.ibm.uima.collection.CasInitializer
, and will
also generally extend from the convenience base class com
.ibm.uima.collection.CasInitializer_ImplBase
. A CAS Initializer also must have an XML
descriptor, which has the exact same form as a Collection Reader Descriptor
except that the outer tag is <casInitializerDescription>
.
CAS Initializers have optional init
ialize()
, reconfigure()
, and typeSystemInit()
methods, which perform the same functions as they
do for Collection Readers. The only
required method for a CAS Initializer is initializeCas(Object,
CAS)
. This method takes the raw
document (for example, an I
nputStream
object from which the document can be read) and a
CAS, and populates the CAS from the document.
An example CAS Initializer is implemented by the class com.ibm.uima.examples.cpe.
. The
SimpleXmlCasInitializer shows how a CAS Initializer can invoke an XML Parser on
the raw document. In this very simple
example the only thing extracted from the XML document is the text to be processed. You can configure the
SimpleXmlCasInitializerSimpleXmlCasInitializer
with the name of an XML tag that contains the text; it will then filter
out the rest of the document.
Here is the implementation of the initializeCas()
method for
this example:
public void initializeCas(Object aObj, CAS aCAS) throws CollectionException, IOException { //build SAX InputSource object from InputStream supplied //by the CollectionReader InputSource inputSource; if (aObj instanceof InputStream) { inputSource = new InputSource((InputStream)aObj); } else { throw new CollectionException( CollectionException.INCORRECT_INPUT_TO_CAS_INITIALIZER, new Object[]{InputStream.class.getName(), Obj.getClass().getName()}); } //create SAX ContentHandler that populates CAS SaxHandler handler = new SaxHandler(aCAS); //parse try { SAXParser parser = mParserFactory.newSAXParser(); XMLReader reader = parser.getXMLReader(); reader.setContentHandler(handler); reader.parse(inputSource); } catch (Exception e) { throw new CollectionException(e); } }
The SaxHandler
class referenced here is an inner class that does
the actual work of extracting the text from the specified XML element. For the full implementation, see the example
code under docs/examples
.
To try out the CAS Initializer, use the CPE Configurator GUI as
described in section 13.3
. However, in
addition to selecting a Collection Reader, Analysis Engine, and CAS Consumer as
described in that section, also select a CAS Initializer by using the
"Browse" button on the CAS Initializer panel. Browse to the %UIMA_HOME%/docs/examples/descriptors/cas_initializer
directory and select the SimpleXmlCasInitializer.xml
descriptor
file. Then, set the "Xml Tag
Containing Text" parameter to the value TEXT. The CPE Configurator should then look like
this:
The SimpleXmlCasInitializer
only works with XML documents, so you will
need to change the "Input Directory" parameter of the Collection
Reader by clicking the "Browse" button and selecting the %UIMA_HOME%/d
ocs/examples/data/xml
directory. Then click the play
button. Once processing has completed,
you can use the XCAS Annotation Viewer, as described in Chapter
20 , to view the results. Notice that only the contents of the
<TEXT> elements in the original source documents appear in the analysis
results.
It is important to note that CAS Initializers will only work with
Collection Readers that are designed to use them. The Collection Reader needs to call its getCasI
nitializer()
method to
see if a CAS Initializer has been supplied, and call the CAS Initializer's initializeCas()
method, rather than setting up the CAS itself. Our File System Collection Reader example
from section 5.4.1
optionally uses a CAS Initializer as follows:
//if there is a CAS Initializer, call it if (getCasInitializer() != null) { getCasInitializer().initializeCas(fis, aCAS); } else //No CAS Initializer, so read file and set document text ourselves { ... }
When you write your own Collection Reader, in the description element of your Collection Reader's descriptor you should document whether your Collection Reader supports (or requires) a CAS Initializer, so that users will know how to configure their CPE properly.
A CAS Consumer receives each CAS after it has been analyzed by the Analysis Engine. CAS Consumers typically do not update the CAS; they typically extract data from the CAS and persist selected information to aggregate data structures such as search engine indexes or databases.
A CAS Consumer Java class must implement the interface com.ibm.uima.collection.CasConsumer
, and will also generally extend from the
convenience base class com.ibm.uima.co
llection.CasConsumer_ImplBase
. A
CAS Consumer also must have an XML descriptor, which has the exact same form as
a Collection Reader Descriptor except that the outer tag is <casConsumerDescription>
.
CAS Consumers have optional initialize()
, reconfigure()
, and typeSystemInit()
methods,
which perform the same functions as they do for Collection Readers and CAS
Initializers. The only required method
for a CAS Consumer is processCas(CAS)
, which is where the CAS Consumer does
the bulk of its work (i.e., consume the CAS).
The CasConsumer
interface additionally defines batch and
collection level processing methods. The
CAS Consumer can implement the batchProcessComplete()
method to
perform processing that should occur at the end of each batch of CASes. Similarly, the CAS Consumer can implement the
collectionProcessComplete()
method to perform any collection level processing at the end of the collection.
A very simple example of a CAS Consumer, which writes an XML
representation of the CAS to a file, is the XCAS Writer CAS Consumer. The Java
code is in the class com.ibm.uima.examples.cpe.XCasWriterCasConsumer
and the
descriptor is in %UIMA_HOME%\docs\examples\descriptors\cas_consumer\XCasWriterCasConsumer.xml
.
When extending the convenience class com.ibm.uima.collection.CasConsumer_ImplBase
,
the following abstract methods must be implemented:
The initialize()
method is
called by the framework when the CAS Consumer is first created. CasConsumer_ImplBase
actually provides a default implementation of this method (i.e., it is not
abstract), so you are not strictly required to implement this method. However, a typical CAS Consumer will
implement this method to obtain parameter values and perform various
initialization steps.
In this method, the CAS Consumer can access the values of its configuration parameters and perform other initialization logic. The example XCAS Writer CAS Consumer reads its configuration parameters and sets up the output directory:
public void initialize() throws ResourceInitializationException { mDocNum = 0; mOutputDir = new File((String)getConfigParameterValue(PARAM_OUTPUTDIR)); if (!mOutputDir.exists()) { mOutputDir.mkdirs(); } }
The processCas()
method is
where the CAS Consumer does most of its work. In our example, the XCAS Writer CAS Consumer obtains an iterator over
the document metadata in the CAS (in the SourceDocumentInformation feature
structure, which is created by the File System Collection Reader) and extracts
the URI for the current document. From
this the output filename is constructed in the output directory and a
subroutine (writeXCas
) is called to generate the
output file. The writeXCas
subroutine uses the XCASSerializer
class
provided with the UIMA SDK to serialize the CAS to the output file (see the
example source code for details).
public void processCas(CAS aCAS) throws ResourceProcessException { JCas jcas; try { jcas = aCAS.getJCas(); } catch (CASException e) { throw new ResourceProcessException(e); }
// retrieve the filename of the input file from the CAS
FSIterator it = jcas.getJFSIndexRepository(). getAnnotationIndex( SourceDocumentInformation.type).iterator(); File outFile = null; if (it.hasNext()) { SourceDocumentInformation fileLoc = (SourceDocumentInformation)it.next(); File inFile; try { inFile = new File(new URL(fileLoc.getUIR()).getPath()); outFile = new File(mOutputDir, inFile.getName()); } catch (MalformedURLException e1) { // invalid URL, use default processing below } } if (null == outFile) { outFile = new File(mOutputDir, "doc"+ mDocNum++); } // serialize XCAS and write to output file try { writeXCas(jcas.getCas(), outFile); } catch (IOException e) { throw new ResourceProcessException(e); } catch (SAXException e) { throw new ResourceProcessException(e); } }
The following methods are optional in a CAS Consumer, though they are often used.
The framework calls the batchProcessComplete() method at the end of each batch of CASes. This gives the CAS Consumer an opportunity to perform any batch level processing. Our simple XCAS Writer CAS Consumer does not perform any batch level processing, so this method is empty. Batch size is set in the Collection Processing Engine descriptor.
The framework calls the collectionProcessComplete() method at the end of the collection (i.e., when all objects in the collection have been processed). At this point in time, no CAS is passed in as a parameter. This gives the CAS Consumer an opportunity to perform collection processing over the entire set of objects in the collection. Our simple XCAS Writer CAS Consumer does not perform any collection level processing, so this method is empty.
The CPM provides a number of service and deployment options that cover instantiation and execution of CPEs, error recovery, and local and distributed deployment of the CPE components. The behavior of the CPM (and correspondingly, the CPE) is controlled by various options and parameters set in the CPE descriptor. The current version of the CPE Configurator tool, however, supports only default error handling and deployment options. To change these options, you must manually edit the CPE descriptor – a potentially error prone task.
Eventually the CPE Configurator tool will support configuring these options and a detailed tutorial for these settings will be provided. In the meantime, we provide only a high-level, conceptual overview of these advanced features in the rest of this chapter, and refer the advanced user to Chapter 24, Collection Processing Engine Descriptor Reference for details on setting these options in the CPE Descriptor.
Figure nn shows a logical view of how an application uses the UIMA framework to instantiate a CPE from a CPE descriptor. The CPE descriptor identifies the CPE components (referencing their corresponding descriptors) and specifies the various options for configuring the CPM and deploying the CPE components.
There are three deployment modes for CAS Processors (Analysis Engines and CAS Consumers) in a CPE:
An integrated CAS Processor runs in the same JVM as the CPE. A managed CAS Processor runs in a separate process from the CPE, but still on the same computer. The CPE controls startup, shutdown, and recovery of a managed CAS Processor. A non-managed CAS Processor runs as a service and may be on the same computer as the CPE or on a remote computer. A non-managed CAS Processor service is started and managed independently from the CPE.
For both managed and non-managed CAS Processors, the CAS must be transmitted between separate processes and possibly between separate computers. This is accomplished using Vinci, a communication protocol used by the CPM that ships with the UIMA SDK. Vinci handles service naming and location and data transport (see 6.6.2, How to Deploy a UIMA Component as a Vinci Service for more information). Service naming and location are provided by a Vinci Naming Service, or VNS. For managed CAS Processors, the CPE uses its own internal VNS. For non-managed CAS Processors, a separate VNS must be running.
The CPE Configurator tool currently only supports constructing CPEs that deploy CAS Processors in integrated mode. To deploy CAS Processors in any other mode, the CPE descriptor must be edited by hand (better tooling support is being worked on). Details on the CPE descriptor and the required settings for various CAS Processor deployment modes can be found in Chapter 24, Collection Processing Engine Descriptor Reference. In the following sections we merely summarize the various CAS Processor deployment options.
Managed CAS Processor deployment is shown in Figure nn. A managed CAS Processor is deployed by the CPE as a Vinci service. The CPE manages the lifecycle of the CAS Processor including service launch, restart on failures, and service shutdown. A managed CAS Processor runs on the same machine as the CPE, but in a separate process. This provides the necessary fault isolation for the CPE to protect it from non-robust CAS Processors. A fatal failure of a managed CAS Processor does not threaten the stability of the CPE.
The CPE communicates with managed CAS Processors using the
Vinci communication protocol. A CAS
Processor is launched as a Vinci service and its process()
method is invoked remotely via a Vinci command. The CPE uses its own internal VNS to support managed CAS
processors. The VNS, by default, listens
on port 9005. If this port is not
available, the VNS will increment its listen port until it finds one that is
available. All managed CAS Processors
are internally configured to "talk" to the CPE managed VNS. This internal VNS is transparent to the end
user launching the CPE.
To deploy a managed CAS Processor, the CPE deployer must change the CPE descriptor. The following is a section from the CPE descriptor that shows an example configuration specifying a managed CAS Processor.
<casProcessor deployment="local" name="Meeting Detector TAE"> <descriptor> <include href="deploy/vinci/Deploy_MeetingDetectorTAE.xml"/> </descriptor> <runInSeparateProcess> <exec dir="." executable="java"> <env key="CLASSPATH" value="src;C:/Program Files/apache-uima/lib/uima_core.jar;C:/Program Files/IBM/uima/lib/uima_cpe.jar;C:/Program Files/apache-uima/lib/uima_examples.jar;C:/Program Files/apache-uima/lib/uima_adapter_vinci.jar;C:/Program Files/apache-uima/lib/uima_jcas_builtin_types.jar;C:/Program Files/apache-uima/lib/vinci/jVinci.jar;C:/Program Files/apache-uima/lib/xml.jar"/> <arg>-DLOG=C:/Temp/service.log</arg> <arg>com.ibm.uima.reference_impl.collection. service.vinci.VinciCasObjectProcessorService_impl</arg> <arg>${descriptor}</arg> </exec> </runInSeparateProcess> <deploymentParameters/> <filter/> <errorHandling> <errorRateThreshold action="terminate" value="1/100"/> <maxConsecutiveRestarts action="terminate" value="3"/> <timeout max="100000"/> </errorHandling> <checkpoint batch="10000"/> </casProcessor>
See Chapter 24, Collection Processing Engine Descriptor Reference for details and required settings.
Non-managed CAS Processor deployment is shown in Figure nn. In non-managed mode, the CPE supports connectivity to CAS Processors running on local or remote computers using Vinci. Non-managed processors are different from managed processors in two aspects:
While non-managed CAS Processors provide the same level of fault isolation and robustness as managed CAS Processors, error recovery support for non-managed CAS Processors is much more limited. In particular, the CPE cannot restart a non-managed CAS Processor after an error.
Non-managed CAS Processors also require a separate Vinci Naming Service running on the network. This VNS must be manually started and monitored by the end user or application. Instructions for running a VNS can be found in section 6.6.5 Starting VNS, .
To deploy a non-managed CAS Processor, the CPE deployer must change the CPE descriptor. The following is a section from the CPE descriptor that shows an example configuration for the non-managed CAS Processor.
<casProcessor deployment="remote" name="Meeting Detector TAE"> <descriptor> <include href= "descriptors/vinciService/MeetingDetectorVinciService.xml"/> </descriptor> <deploymentParameters/> <filter/> <errorHandling> <errorRateThreshold action="terminate" value="1/100"/> <maxConsecutiveRestarts action="terminate" value="3"/> <timeout max="100000"/> </errorHandling> <checkpoint batch="10000"/> </casProcessor>
See Chapter 24, Collection Processing Engine Descriptor Reference for details and required settings.
Integrated CAS Processors are shown in Figure 16. Here the CAS Processors run in the same JVM as the CPE, just like the Collection Reader and CAS Initializer. This deployment method results in minimal CAS communication and transport overhead as the CAS is shared in the same process space of the JVM. However, a CPE running with all integrated CAS Processors is limited in scalability by the capability of the single computer on which the CPE is running. There is also a stability risk associated with integrated processors because a poorly written CAS Processor can cause the JVM, and hence the entire CPE, to abort.
The following is a section from a CPE descriptor that shows an example configuration for the integrated CAS Processor.
<casProcessor deployment="integrated" name="Meeting Detector TAE"> <descriptor> <include href="descriptors/tutorial/ex4/MeetingDetectorTAE.xml"/> </descriptor> <deploymentParameters/> <filter/> <errorHandling> <errorRateThreshold action="terminate" value="100/1000"/> <maxConsecutiveRestarts action="terminate" value="30"/> <timeout max="100000"/> </errorHandling> <checkpoint batch="10000"/> </casProcessor>
See Chapter 24, Collection Processing Engine Descriptor Reference for details and required settings.
The UIMA SDK includes a set of examples illustrating the
three modes of deployment, integrated, managed, and non-managed. These are in the /docs/examples/descriptors/collection_processing_engine
directory. There are three CPE
descriptors that run an example annotator (the Meeting Finder) in these modes.
To run either the integrated or managed examples, use the runCPE
script in the /bin directory of the UIMA
installation, passing the appropriate CPE descriptor as an argument.
runCPE
script must
be run from the %UIMA_HOME%\docs\examples
directory, because it uses relative path names that are
resolved relative to this working directory. For instance,
runCPE
descriptors\collection_processing_engine\MeetingFinderCPE_Integrated.xml
If you installed the examples into Eclipse, you can run directly from Eclipse by creating a run configuration. To do this, highlight the SimpleRunCPE.java source file in the examples src/com/ibm/uima/examples/cpe directory, and then
descriptors/collection_processing_engine/MeetingFinderCPE_Integrated.xml
To run the non-managed example, there are some additional steps.
startVNS
script in the /bin
directory.startVinciService
script in the /bin
directory, and passing it the location of the
descriptor to deploy, in this case %UIMA_HOME%/docs/examples/deploy/vinci/Deploy_MeetingDetectorTAE.xml
(%UIMA_HOME%/docs/examples/descriptors/collection_processing_engine/MeetingFinderCPE_NonManaged.xml
).
This assumes that the Vinci Naming Service, the runCPE
application, and the MeetingDetectorTAE
service are
all running on the same machine. Most of the scripts that need information
about VNS will look for values to use in environment variables VNS_HOST and
VNS_PORT; these default to "localhost" and "9000". You may set these to appropriate values
before running the scripts, as needed; you can also pass the name of the VNS
host as the 2nd argument to the startVinciService script.
Alternatively, you can edit the scripts and/or the XML
files to specify alternatives for the VNS_HOST and VNS_PORT. For instance, if the runCPE
application is running on a different machine from the Vinci Naming Service,
you can edit the MeetingFinderCPE_NonManaged.xml
and change the vnsHost parameter:
<parameter name="vnsHost" value="localhost"
type="string"/>
to specify the VNS host instead of "localhost".