The UIMA analysis components (Annotators and CAS Consumers) described previously in this manual all take a single CAS as input, optionally make modifications to it, and output that same CAS. This chapter describes an advanced feature that became available in the UIMA SDK v2.0: a new type of analysis component called a CAS Multiplier, which can create new CASes during processing.
CAS Multipliers are often used to split a large artifact into manageable pieces. This is a common requirement of audio and video analysis applications, but can also occur in text analysis on very large documents. A CAS Multiplier would take as input a single CAS representing the large artifact (perhaps by a remote reference to the actual data -- see Formats of Sofa Data ) and produce as output a series of new CASes each of which contains only a small portion of the original artifact.
CAS Multipliers are not limited to dividing an artifact into smaller pieces, however. A CAS Multiplier can also be used to combine smaller segments together to form larger segments. In general, a CAS Multiplier is used to change the segmentation of a series of CASes; that is, to change how a stream of data is divided among discrete CAS objects.
CAS Multiplier implementations should extend from the JCasMultiplier_ImplBase
or CasMultiplier_ImplBase
classes, depending on which CAS interface they prefer to use. As with other types of analysis components,
the CAS Multiplier ImplBase classes define optional initialize
,
destroy
, and reconfigure
methods. There are then three required
methods: process
, hasNext
,
and next
. The
framework interacts with these methods as follows:
process
method, passing it an input CAS. The
process method returns, but may hold on to a reference to the input CAS.hasNext
method. The CAS Multiplier should return true
from
this method if it intends to output one or more new CASes (for instance,
segments of this CAS), and false
if not.hasNext
returned true, the
framework will call the CAS Multiplier's next
method. The CAS Multiplier creates a new CAS (we will
see how in a moment), populates it, and returns it from the hasNext
method.hasNext
returns false.
From the time when process
is
called until the hasNext
method returns false, the
CAS Multiplier "owns" the CAS that was passed to its process
method. The CAS Multiplier can store a reference to this CAS in a local field
and can read from it or write to it during this time. Once hasNext
returns false, the CAS Multiplier gives up ownership of the input CAS and
should no longer retain a reference to it.
The CAS Multiplier's next
method must return a CAS instance that represents a new representation of the
input artifact. Since CAS instances are
managed by the framework, the CAS Multiplier cannot actually create a new CAS;
instead it should request an empty CAS by calling the method:
CAS getEmptyCAS()
or
JCas getEmptyJCas()
which are defined on the CasMultiplier_ImplBase
and JCasMultiplier_ImplBase
classes, respectively.
Note that if it is more convenient you can request an
empty CAS during the process
or hasNext
methods, not just during the next
method.
By default, a CAS Multiplier is only allowed to hold one
output CAS instance at a time. You must
return the CAS from the next
method before you can
request a second CAS. If you try to call
getEmptyCAS a second time you will get an Exception. You can change this default behavior by
overriding the method getCasInstancesRequired
to
return the number of CAS instances that you need. Be aware that CAS instances consume a
significant amount of memory, so setting this to a large value will cause your
application to use a lot of RAM. So, for
example, it is not a good practice to attempt to generate a large number of new
CASes in the CAS Multiplier's process
method. Instead, you should spread your processing
out across the calls to the hasNext
or next
methods.
This section walks through the source code of an example
CAS Multiplier that breaks text documents into smaller pieces. The Java class for the example is com.ibm.uima.examples.casMultiplier.SimpleTextSegmenter
and the source code is included in the UIMA SDK under the docs/examples/src
directory.
public class SimpleTextSegmenter extends JCasMultiplier_ImplBase { private String mDoc; private int mPos; private int mSegmentSize; private String mDocUri;
public void initialize(UimaContext aContext) throws ResourceInitializationException { ... }
public void process(JCas aJCas) throws AnalysisEngineProcessException { ... }
public boolean hasNext() throws AnalysisEngineProcessException { ... }
public AbstractCas next() throws AnalysisEngineProcessException { ... } }
The SimpleTextSegmenter
class
extends JCasMultiplier_ImplBase
and implements the
optional initialize
method as well as the required process
, hasNext
, and next
methods. Each
method is described below.
public void initialize(UimaContext aContext) throws ResourceInitializationException { super.initialize(aContext); mSegmentSize = ((Integer)aContext.getConfigParameterValue( "SegmentSize")).intValue(); }
Like an Annotator, a CAS Multiplier can override the initialize method and read configuration parameter values from the UimaContext. The SimpleTextSegmenter defines one parameter, "Segment Size", which determines the approximate size (in characters) of each segment that it will produce.
public void process(JCas aJCas) throws
AnalysisEngineProcessException
{
mDoc = aJCas.getDocumentText();
mPos = 0;
// retreive the filename of the input
file from the CAS so that it can
// be added to each segment
FSIterator it = aJCas.getJFSIndexRepository()
.getAnnotationIndex(SourceDocumentInformation.type).iterator();
if (it.hasNext())
{
SourceDocumentInformation fileLoc =
(SourceDocumentInformation)it.next();
mDocUri = fileLoc.getUri();
}
else
{
mDocUri = null;
}
}
The process method receives a new JCas to be processed(segmented) by this CAS Multiplier. The SimpleTextSegmenter extracts some information from this JCas and stores it in fields (the document text is stored in the field mDoc and the source URI in the field mDocURI). Recall that the CAS Multiplier is considered to "own" the JCas from the time when process is called until the time when hasNext returns false. Therefore it is acceptable to retain references to objects from the JCas in a CAS Multiplier, whereas this should never be done in an Annotator. The CAS Multiplier could have chosen to store a reference to the JCas itself, but that was not necessary for this example.
The CAS Multiplier also initializes the mPos variable to 0. This variable is a position into the document text and will be incremented as each new segment is produced.
public boolean hasNext() throws AnalysisEngineProcessException
{
return mPos < mDoc.length();
}
The job of the hasNext method is to report whether there are any additional output CASes to produce. For this example, the CAS Multiplier will break the entire input document into segments, so we know there will always be a next segment until the very end of the document has been reached.
public AbstractCas next() throws AnalysisEngineProcessException { int breakAt = mPos + mSegmentSize; if (breakAt > mDoc.length()) breakAt = mDoc.length(); // Search for the next newline character. Note: this example // segmenter implementation assumes that the document contains many // newlines. In the worst case, if this segmenter is run on a // document with no newlines, it will produce only one segment // containing the entire document text. A better implementation // might specify a maximum segment size as well as a minimum. while (breakAt < mDoc.length() && mDoc.charAt(breakAt-1) != ©n©) breakAt++; JCas jcas = getEmptyJCas(); try { jcas.setDocumentText(mDoc.substring(mPos, breakAt)); //if original CAS had SourceDocumentInformation, //also add SourceDocumentInformation to each segment if (mDocUri != null) { SourceDocumentInformation sdi = new SourceDocumentInformation(jcas); sdi.setUri(mDocUri); sdi.setOffsetInSource(mPos); sdi.setDocumentSize(breakAt - mPos); sdi.addToIndexes(); } mPos = breakAt; return jcas; } catch(Exception e) { jcas.release(); throw new AnalysisEngineProcessException(e); } }
The next
method actually
produces the next segment and returns it. The framework guarantees that it will not call next
unless hasNext
has returned true since the last
call to process
or next
.
Note that in order to produce a segment, the CAS Multiplier must get an empty JCas to populate. This is done by the line:
JCas jcas = getEmptyJCas();
This requests an empty JCas from the framework, which maintains a pool of JCas instances to draw from.
Also, note the use of the try...catch
block to ensure that a JCas is released back to the pool if an exception
occurs. This is very important to allow
a CAS Multiplier to recover from errors.
There is not a separate type of descriptor for a CAS Multiplier. CAS Multiplier are considered a type of Analysis Engine, and so their descriptors use the same syntax as any other Analysis Engine Descriptor.
The Analysis Engine Description, in its "Operational Properties" section, now contains a new "outputsNewCASes" property which takes a Boolean value. If the Analysis Engine is a CAS Multiplier, this property should be set to true.
If you use the CDE, be sure to check the "Outputs new CASes" box in the Runtime Information section on the Overview page, as shown here:
If you edit the Analysis Engine Descriptor by hand, you
need to add a <outputsNewCASes>
element to
your descriptor as shown here:
<operationalProperties>
<modifiesCas> false
</modifiesCas>
<multipleDeploymentAllowed> true
</multipleDeploymentAllowed>
<outputsNewCASes>true</outputsNewCASes>
</operationalProperties>
You can include a CAS Multiplier as a component in an Aggregate Analysis Engine. For example, this allows you to construct an Aggregate Analysis Engine that takes each input CAS, breaks it up into segments, and runs a series of Annotators on each segment.
Since CAS Multiplier are considered a type of Analysis
Engine, adding them to an aggregate works the same way as for other Analysis
Engines. Using the CDE, you just click
the "Add..." button in the Component Engines view and browse to the
Analysis Engine Descriptor of your CAS Multiplier. If editing the aggregate descriptor directly,
just import
the Analysis Engine Descriptor of your
CAS Multiplier as usual.
CAS Multipliers are only supported in the context of Fixed Flow or custom Flow Control. If you use the built-in "Fixed Flow" for your Aggregate Analysis Engine, you can position the CAS Multiplier anywhere in that flow. Processing then works as follows: When a CAS is input to the Aggregate AE, that CAS is routed to the components in the order specified by the Fixed Flow, until that CAS reaches a CAS Multiplier. Once the CAS reaches a CAS Multiplier, it will not complete the rest of the flow. Instead, each new output CAS from that CAS Multiplier will continue through the flow, starting at the node immediately after the CAS Multiplier in the Fixed Flow. No further processing will be done on the original input CAS after it has reached a CAS Multiplier.
It is possible to put more than one CAS Multiplier in your flow. In this case, when a new CAS output from the first CAS Multiplier reaches the second CAS Multiplier, no further processing will occur on that CAS, and any new output CASes produced by the second CAS Multiplier will continue the flow starting at the node after the second CAS Multiplier.
If you would like to have different kind of flow, you will need to implement a custom FlowController as described in Flow Controller Developer's Guide . For example, you could implement a flow where a CAS that is input to a CAS Multiplier will continue to be processed by other components after the CAS Multiplier is finished with it.
An important consideration when you put a CAS Multiplier
inside an Aggregate Analysis Engine is whether you want the Aggregate to also
function as a CAS Multiplier – that is, whether you want the new output CASes
produced within the Aggregate to be output from the Aggregate. This is controlled by the <outputsNewCASes>
element in the Operational
Properties of your Aggregate Analysis Engine descriptor. The syntax is the same as what was described
in section Creating
the CAS Multiplier
Descriptor .
If you set this property to true
,
then any new output CASes produced by a CAS Multiplier inside this Aggregate
will be output from the Aggregate. Thus
the Aggregate will function as a CAS Multiplier and can be used in any of the
ways in which a primitive CAS Multiplier can be used.
If you set the <outputsNewCASes> property to false
, then any new output CASes produced by a CAS
Multiplier inside the Aggregate will be dropped (i.e. the CASes will be
released back to the pool) once they have finished being processed. Such an Aggregate Analysis Engine functions
just like a "normal" non-CAS-Multiplier Analysis Engine; the fact
that CAS Multiplication is occurring inside it is hidden from users of that
Analysis Engine.
It is currently a limitation that CAS Multiplier cannot be
deployed directly in a Collection Processing Engine. The only way that you can use a CAS
Multiplier in a CPE is to first wrap it in an Aggregate Analysis Engine whose outputsNewCASes
property is set to false
,
which in effect hides the existence of the CAS Multiplier from the CPE.
Note that you can build an Aggregate Analysis Engine that consists of CAS Multipliers and Annotators, followed by CAS Consumers. This can simulate what a CPE would do, but without the deployment and error handling options that the CPE provides.
The AnalysisEngine
interface
has the following methods that allow you to interact with CAS Multiplier:
CasIterator processAndOutputNewCASes(CAS)
JCasIterator processAndOutputNewCASes(JCas)
From your application, you call processAndOutputNewCASes
and pass it the input CAS. An iterator
is returned that allows you to step through each of the new output CASes that
are produced by the Analysis Engine.
It is very important to realize that CASes are pooled
objects and so your application must release each CAS (by calling the CAS.release()
method) that it obtains from the
CasIterator before it calls the CasIterator.next
method again. Otherwise, the CAS pool
will be exhausted and a deadlock will occur.
The example code in the class com.ibm.uima.examples.casMultiplier.
CasMultiplierExampleApplication
illusrates this. Here is the main processing loop:
CasIterator casIterator = ae.processAndOutputNewCASes
(initialCas);
while (casIterator.hasNext())
{
CAS outCas = casIterator.next();
//dump the document text and annotations for this segment System.out.println("********* NEW SEGMENT *********"); System.out.println(outCas.getDocumentText()); PrintAnnotations.printAnnotations(outCas, System.out);
//release the CAS (important) outCas.release();
Note that as defined by the CAS Multiplier contract in
section CAS
Multiplier Interface
Overview, the CAS Multiplier owns the input CAS (initialCAS
in the example) until the last new output CAS
has been produced. This means that the
application should not try to make changes to initialCAS
until after the CasIterator.hasNext
method has
returned false, indicating that the segmenter has finished.
Note that the processing time of the Analysis Engine is
spread out over the calls to the CasIterator©s hasNext
and next
methods. That is, the next output CAS may not actually be produced and annotated
until the application asks for it. So
the application should not expect calls to the CasIterator
to necessarily complete quickly.
Also, calls to the CasIterator
may throw Exceptions indicating an error has occurred during processing. If an
Exception is thrown, all processing of the input CAS will stop, and no more
output CASes will be produced. There is
currently no error recovery mechanism that will allow processing to continue
after an exception.