CAS Multiplier Developer's Guide

The UIMA analysis components (Annotators and CAS Consumers) described previously in this manual all take a single CAS as input, optionally make modifications to it, and output that same CAS. This chapter describes an advanced feature that became available in the UIMA SDK v2.0: a new type of analysis component called a CAS Multiplier, which can create new CASes during processing.

CAS Multipliers are often used to split a large artifact into manageable pieces. This is a common requirement of audio and video analysis applications, but can also occur in text analysis on very large documents. A CAS Multiplier would take as input a single CAS representing the large artifact (perhaps by a remote reference to the actual data -- see Formats of Sofa Data ) and produce as output a series of new CASes each of which contains only a small portion of the original artifact.

CAS Multipliers are not limited to dividing an artifact into smaller pieces, however. A CAS Multiplier can also be used to combine smaller segments together to form larger segments. In general, a CAS Multiplier is used to change the segmentation of a series of CASes; that is, to change how a stream of data is divided among discrete CAS objects.

CAS Multiplier Interface Overview

CAS Multiplier implementations should extend from the JCasMultiplier_ImplBase or CasMultiplier_ImplBase classes, depending on which CAS interface they prefer to use. As with other types of analysis components, the CAS Multiplier ImplBase classes define optional initialize, destroy, and reconfigure methods. There are then three required methods: process, hasNext, and next. The framework interacts with these methods as follows:

The framework calls the CAS Multiplier's process method, passing it an input CAS. The process method returns, but may hold on to a reference to the input CAS.
The framework then calls the CAS Multiplier's hasNext method. The CAS Multiplier should return true from this method if it intends to output one or more new CASes (for instance, segments of this CAS), and false if not.
If hasNext returned true, the framework will call the CAS Multiplier's next method. The CAS Multiplier creates a new CAS (we will see how in a moment), populates it, and returns it from the hasNext method.
Steps 2 and 3 continue until hasNext returns false.

From the time when process is called until the hasNext method returns false, the CAS Multiplier "owns" the CAS that was passed to its process method. The CAS Multiplier can store a reference to this CAS in a local field and can read from it or write to it during this time. Once hasNext returns false, the CAS Multiplier gives up ownership of the input CAS and should no longer retain a reference to it.

How to Get an Empty CAS Instance

The CAS Multiplier's next method must return a CAS instance that represents a new representation of the input artifact. Since CAS instances are managed by the framework, the CAS Multiplier cannot actually create a new CAS; instead it should request an empty CAS by calling the method:

CAS getEmptyCAS()

JCas getEmptyJCas()

which are defined on the CasMultiplier_ImplBase and JCasMultiplier_ImplBase classes, respectively.

Note that if it is more convenient you can request an empty CAS during the process or hasNext methods, not just during the next method.

By default, a CAS Multiplier is only allowed to hold one output CAS instance at a time. You must return the CAS from the next method before you can request a second CAS. If you try to call getEmptyCAS a second time you will get an Exception. You can change this default behavior by overriding the method getCasInstancesRequired to return the number of CAS instances that you need. Be aware that CAS instances consume a significant amount of memory, so setting this to a large value will cause your application to use a lot of RAM. So, for example, it is not a good practice to attempt to generate a large number of new CASes in the CAS Multiplier's process method. Instead, you should spread your processing out across the calls to the hasNext or next methods.

Example Code

This section walks through the source code of an example CAS Multiplier that breaks text documents into smaller pieces. The Java class for the example is com.ibm.uima.examples.casMultiplier.SimpleTextSegmenter and the source code is included in the UIMA SDK under the docs/examples/src directory.

Overall Structure

public class SimpleTextSegmenter extends JCasMultiplier_ImplBase { private String mDoc; private int mPos; private int mSegmentSize; private String mDocUri;

public void initialize(UimaContext aContext) throws ResourceInitializationException { ... }

public void process(JCas aJCas) throws AnalysisEngineProcessException { ... }

public boolean hasNext() throws AnalysisEngineProcessException { ... }

public AbstractCas next() throws AnalysisEngineProcessException { ... } }

The SimpleTextSegmenter class extends JCasMultiplier_ImplBase and implements the optional initialize method as well as the required process, hasNext, and next methods. Each method is described below.

Initialize Method

public void initialize(UimaContext aContext) throws ResourceInitializationException { super.initialize(aContext); mSegmentSize = ((Integer)aContext.getConfigParameterValue( "SegmentSize")).intValue(); }

Like an Annotator, a CAS Multiplier can override the initialize method and read configuration parameter values from the UimaContext. The SimpleTextSegmenter defines one parameter, "Segment Size", which determines the approximate size (in characters) of each segment that it will produce.

Process Method

public void process(JCas aJCas) throws AnalysisEngineProcessException { mDoc = aJCas.getDocumentText(); mPos = 0; // retreive the filename of the input file from the CAS so that it can // be added to each segment FSIterator it = aJCas.getJFSIndexRepository() .getAnnotationIndex(SourceDocumentInformation.type).iterator(); if (it.hasNext()) { SourceDocumentInformation fileLoc = (SourceDocumentInformation)it.next(); mDocUri = fileLoc.getUri(); } else { mDocUri = null; } }

The process method receives a new JCas to be processed(segmented) by this CAS Multiplier. The SimpleTextSegmenter extracts some information from this JCas and stores it in fields (the document text is stored in the field mDoc and the source URI in the field mDocURI). Recall that the CAS Multiplier is considered to "own" the JCas from the time when process is called until the time when hasNext returns false. Therefore it is acceptable to retain references to objects from the JCas in a CAS Multiplier, whereas this should never be done in an Annotator. The CAS Multiplier could have chosen to store a reference to the JCas itself, but that was not necessary for this example.

The CAS Multiplier also initializes the mPos variable to 0. This variable is a position into the document text and will be incremented as each new segment is produced.

HasNext Method

public boolean hasNext() throws AnalysisEngineProcessException { return mPos < mDoc.length(); }

The job of the hasNext method is to report whether there are any additional output CASes to produce. For this example, the CAS Multiplier will break the entire input document into segments, so we know there will always be a next segment until the very end of the document has been reached.

Next Method

public AbstractCas next() throws AnalysisEngineProcessException { int breakAt = mPos + mSegmentSize; if (breakAt > mDoc.length()) breakAt = mDoc.length(); // Search for the next newline character. Note: this example // segmenter implementation assumes that the document contains many // newlines. In the worst case, if this segmenter is run on a // document with no newlines, it will produce only one segment // containing the entire document text. A better implementation // might specify a maximum segment size as well as a minimum. while (breakAt < mDoc.length() && mDoc.charAt(breakAt-1) != ©n©) breakAt++; JCas jcas = getEmptyJCas(); try { jcas.setDocumentText(mDoc.substring(mPos, breakAt)); //if original CAS had SourceDocumentInformation, //also add SourceDocumentInformation to each segment if (mDocUri != null) { SourceDocumentInformation sdi = new SourceDocumentInformation(jcas); sdi.setUri(mDocUri); sdi.setOffsetInSource(mPos); sdi.setDocumentSize(breakAt - mPos); sdi.addToIndexes(); } mPos = breakAt; return jcas; } catch(Exception e) { jcas.release(); throw new AnalysisEngineProcessException(e); } }

The next method actually produces the next segment and returns it. The framework guarantees that it will not call next unless hasNext has returned true since the last call to process or next.

Note that in order to produce a segment, the CAS Multiplier must get an empty JCas to populate. This is done by the line:

JCas jcas = getEmptyJCas();

This requests an empty JCas from the framework, which maintains a pool of JCas instances to draw from.

Also, note the use of the try...catch block to ensure that a JCas is released back to the pool if an exception occurs. This is very important to allow a CAS Multiplier to recover from errors.

There is not a separate type of descriptor for a CAS Multiplier. CAS Multiplier are considered a type of Analysis Engine, and so their descriptors use the same syntax as any other Analysis Engine Descriptor.

The Analysis Engine Description, in its "Operational Properties" section, now contains a new "outputsNewCASes" property which takes a Boolean value. If the Analysis Engine is a CAS Multiplier, this property should be set to true.

If you use the CDE, be sure to check the "Outputs new CASes" box in the Runtime Information section on the Overview page, as shown here:

If you edit the Analysis Engine Descriptor by hand, you need to add a <outputsNewCASes> element to your descriptor as shown here:

<operationalProperties> <modifiesCas>false </modifiesCas> <multipleDeploymentAllowed>true </multipleDeploymentAllowed> <outputsNewCASes>true</outputsNewCASes> </operationalProperties>

The "modifiedCas" operational property refers to the input CAS, not the new output CASes produced. So our example SimpleTextSegmenter has modifiesCas set to false since it doesn't modify the input CAS.

You can include a CAS Multiplier as a component in an Aggregate Analysis Engine. For example, this allows you to construct an Aggregate Analysis Engine that takes each input CAS, breaks it up into segments, and runs a series of Annotators on each segment.

Adding the CAS Multiplier to the Aggregate

Since CAS Multiplier are considered a type of Analysis Engine, adding them to an aggregate works the same way as for other Analysis Engines. Using the CDE, you just click the "Add..." button in the Component Engines view and browse to the Analysis Engine Descriptor of your CAS Multiplier. If editing the aggregate descriptor directly, just import the Analysis Engine Descriptor of your CAS Multiplier as usual.

CAS Multipliers and Flow Control

CAS Multipliers are only supported in the context of Fixed Flow or custom Flow Control. If you use the built-in "Fixed Flow" for your Aggregate Analysis Engine, you can position the CAS Multiplier anywhere in that flow. Processing then works as follows: When a CAS is input to the Aggregate AE, that CAS is routed to the components in the order specified by the Fixed Flow, until that CAS reaches a CAS Multiplier. Once the CAS reaches a CAS Multiplier, it will not complete the rest of the flow. Instead, each new output CAS from that CAS Multiplier will continue through the flow, starting at the node immediately after the CAS Multiplier in the Fixed Flow. No further processing will be done on the original input CAS after it has reached a CAS Multiplier.

It is possible to put more than one CAS Multiplier in your flow. In this case, when a new CAS output from the first CAS Multiplier reaches the second CAS Multiplier, no further processing will occur on that CAS, and any new output CASes produced by the second CAS Multiplier will continue the flow starting at the node after the second CAS Multiplier.

If you would like to have different kind of flow, you will need to implement a custom FlowController as described in Flow Controller Developer's Guide . For example, you could implement a flow where a CAS that is input to a CAS Multiplier will continue to be processed by other components after the CAS Multiplier is finished with it.

Aggregate CAS Multipliers

An important consideration when you put a CAS Multiplier inside an Aggregate Analysis Engine is whether you want the Aggregate to also function as a CAS Multiplier – that is, whether you want the new output CASes produced within the Aggregate to be output from the Aggregate. This is controlled by the <outputsNewCASes> element in the Operational Properties of your Aggregate Analysis Engine descriptor. The syntax is the same as what was described in section Creating the CAS Multiplier Descriptor .

If you set this property to true, then any new output CASes produced by a CAS Multiplier inside this Aggregate will be output from the Aggregate. Thus the Aggregate will function as a CAS Multiplier and can be used in any of the ways in which a primitive CAS Multiplier can be used.

If you set the <outputsNewCASes> property to false, then any new output CASes produced by a CAS Multiplier inside the Aggregate will be dropped (i.e. the CASes will be released back to the pool) once they have finished being processed. Such an Aggregate Analysis Engine functions just like a "normal" non-CAS-Multiplier Analysis Engine; the fact that CAS Multiplication is occurring inside it is hidden from users of that Analysis Engine.

If you want to output some new Output CASes and not others, you need to implement a custom Flow Controller that makes this decision -- see the Flow Controller Developer's Guide .

It is currently a limitation that CAS Multiplier cannot be deployed directly in a Collection Processing Engine. The only way that you can use a CAS Multiplier in a CPE is to first wrap it in an Aggregate Analysis Engine whose outputsNewCASes property is set to false, which in effect hides the existence of the CAS Multiplier from the CPE.

Note that you can build an Aggregate Analysis Engine that consists of CAS Multipliers and Annotators, followed by CAS Consumers. This can simulate what a CPE would do, but without the deployment and error handling options that the CPE provides.

The AnalysisEngine interface has the following methods that allow you to interact with CAS Multiplier:

CasIterator processAndOutputNewCASes(CAS)

JCasIterator processAndOutputNewCASes(JCas)

From your application, you call processAndOutputNewCASes and pass it the input CAS. An iterator is returned that allows you to step through each of the new output CASes that are produced by the Analysis Engine.

It is very important to realize that CASes are pooled objects and so your application must release each CAS (by calling the CAS.release() method) that it obtains from the CasIterator before it calls the CasIterator.next method again. Otherwise, the CAS pool will be exhausted and a deadlock will occur.

The example code in the class com.ibm.uima.examples.casMultiplier. CasMultiplierExampleApplication illusrates this. Here is the main processing loop:

CasIterator casIterator = ae.processAndOutputNewCASes(initialCas); while (casIterator.hasNext()) { CAS outCas = casIterator.next();

//dump the document text and annotations for this segment System.out.println("********* NEW SEGMENT *********"); System.out.println(outCas.getDocumentText()); PrintAnnotations.printAnnotations(outCas, System.out);

//release the CAS (important) outCas.release();

Note that as defined by the CAS Multiplier contract in section CAS Multiplier Interface Overview, the CAS Multiplier owns the input CAS (initialCAS in the example) until the last new output CAS has been produced. This means that the application should not try to make changes to initialCAS until after the CasIterator.hasNext method has returned false, indicating that the segmenter has finished.

Note that the processing time of the Analysis Engine is spread out over the calls to the CasIterator©s hasNext and next methods. That is, the next output CAS may not actually be produced and annotated until the application asks for it. So the application should not expect calls to the CasIterator to necessarily complete quickly.

Also, calls to the CasIterator may throw Exceptions indicating an error has occurred during processing. If an Exception is thrown, all processing of the input CAS will stop, and no more output CASes will be produced. There is currently no error recovery mechanism that will allow processing to continue after an exception.