Annotations, Artifacts, and Sofas

Up to this point, the documentation has focused on analyzing strings of Unicode text, producing subtypes of Annotations which reference offsets in those strings. This chapter generalizes this concept and shows how other kinds of artifacts can be handled, including non-text things like audio and images, and how you can define your own kinds of "annotations" for these.

Artifact

The Artifact is the unstructured thing being analyzed by an annotator. It could be an HTML web page, an image, a video stream, a recorded audio conversation, an MPEG-4 stream, etc. Artifacts are often restructured in the course of processing to facilitate particular kinds of analysis. For instance, an HTML page may be convered into a "de-tagged" version. Annotators at different places in the pipeline may be analyzing different versions of the artifact.

Subject of Analysis -- Sofa

Each representation of an Artifact is called a Subject of Analysis, abbreviated using the acronym "Sofa" which stands for Subject OF Analysis. Annotation metadata, which have explicit designations of sub-regions of the artifact to which they apply, are always associated with a particular Sofa. For instance, an annotation over text specifies two features, the begin and end, which represent the character offsets into the text string Sofa being analyzed.

Other examples of representations of Artifacts, which could be Sofas include: An HTML web page, a detagged web page, the translated text of that document, an audio or video stream, closed-caption text from a video stream, etc.

Often, there is one Sofa being analyzed in a CAS. The next chapter will show how UIMA facilitates working with multiple representations of an artifact at the same time, in the same CAS.

Sofa data can be Java Unicode Strings, Feature Structure arrays of primitive types, or a URI which references remote data available via a network connection.

The arrays of primitive types can be things like byte arrays or float arrays, and are intended to be used for artifacts like audio data, image data, etc.

The URI form holds a URI specification String.

Setting Sofa Data

When a CAS is created, you can set its Sofa Data, just one time; this property insures that metadata describing regions of the Sofa remain valid. As a consequence, the following methods that set data for a given Sofa can only be called once for a given Sofa.

The following methods on the CAS set the Sofa Data to one of the 3 formats. Assume that the variable "aCas" holds a reference to a CAS:

aCas.setSofaDataString(document_text_string, mime_type_string); aCas.setSofaDataArray(feature_structure_primitive_array, mime_type_string); aCas.setSofaDataURI(uri_string, mime_type_string);

In addition, the method aCas.setDocumentText(document_text_string) may still be used, and is equivalent to setSofaDataString(string, "text"). The mime type is currently not used by the UIMA framework, but may be set and retrieved by user code.

Feature Structure primitive arrays are all the UIMA Array types except arrays of Feature Structures, Strings, and Booleans. Typically, these are arrays of bytes, but can be other types, such as floats, longs, etc.

The URI string should conform to the standard URI format.

Accessing Sofa Data

The analysis algorithms typically work with the Sofa data. The following methods on the CAS access the Sofa Data:

String aCas.getDocumentText(); String aCas.getSofaDataString(); FeatureStructure aCas.getSofaDataArray(); String aCas.getSofaDataURI(); String aCas.getSofaMimeType();

The getDocumentText and getSofaDataString return the same text string. The getSofaDataURI returns the URI itself, not the data the URI is pointing to. You can use standard Java I/O capabilities to get the data associated with the URI, or use the UIMA Framework Streaming method described next.

Accessing Sofa Data using a Java Stream

The framework provides a consistent method for accessing the Sofa data, independent of it being stored locally, or accessed remotely using the URI. Get a Java InputStream instance from the Sofa data using:

InputStream inputStream = aCas.getSofaDataStream();

If the data is local, this method returns a ByteArrayInputStream. This stream provides bytes.
- If the Sofa data was set using setDocumentText or setSofaDataString, the String is converted to bytes by using the UTF-8 encoding.
- If the Sofa data was set as a DataArray, the bytes in the data array are serialized, high-byte first.
If the Sofa data was specified as a URI, this method returns the handle from url.openStream(). Java offers built-in support for several URI schemes including “FILE:", "HTTP:", "FTP:"and has an extensible mechanism, URLStreamHandlerFactory, for customizing access to an arbitrary URI. See more details at http://java.sun.com/j2se/1.4.2/docs/api/java/net/URLStreamHandlerFactory.html.

Information about a Sofa is contained in a special built-in Feature Structure of type uima.cas.Sofa. This feature structure is created and managed by the UIMA Framework; users must not create it directly. Although these Sofa type instances are implemented as standard feature structures, generic CAS APIs can not be used to create Sofas or set their features. Instead, Sofas are created implicitly by the creation of new CAS views. Similarly, Sofa features are set by CAS methods such as cas.setDocumentText().

Features of the Sofa type include

SofaID: Every Sofa in a CAS has a unique SofaID. SofaIDs are the primary handle for access. This ID is often the same as the name string given to the Sofa by the developer, but it can be mapped to a different name (see Sofa Name Mapping ).
Mime type: This string feature can be used to describe the type of the data represented by a Sofa. It is not used by the framework; the framework provides APIs to set and get its value.
Sofa Data: The Sofa data itself. This data can be resident in the CAS or it can be a reference to data outside the CAS.

Annotators add meta data about a Sofa to the CAS. It is often useful to have this metadata denote a region of the Sofa to which it applies. For instance, assuming the Sofa is a String, the metadata might describe a particular substring as the name of a person. The built-in UIMA type, uima.tcas.Annotation, has two extra features that enable this - the begin and end features - which denote a character position offset into the text string being analyzed.

The concept of "annotations" can be generalized for non-string kinds of Sofas. For instance, an audio stream might have an audio annotation which describes sounds regions in terms of floating point time offsets in the Sofa. An image annotation might use two pairs of x,y coordinates to define the region the annotation applies to.

Built-in Annotation types

The built-in CAS type, uima.tcas.Annotation, is just one kind of definition of an Annotation. It was designed for annotating text strings, and has begin and end features which describe which substring of the Sofa being annotated.

For applications which have other kinds of Sofas, the UIMA developer will design their own kinds of Annotation types, as needed to describe an annotation, by declaring new types which are subtypes of uima.cas.AnnotationBase. For instance, for images, you might have the concept of a rectangular region to which the annotation applies. In this case, you might describe the region with 2 pairs of x, y coordinates.

Annotations have an associated Sofa

Annotations are always associated with a particular Sofa. In subsequent chapters, you will learn how there can be multiple Sofas associated with an artifact; which Sofa an annotation refers to is described by the Annotation feature structure itself.

All annotation types extend from the built-in type uima.cas.AnnotationBase. This type has one feature, a reference to the Sofa associated with the annotation. This value is currently used by the Framework to support the getCoveredText() method on the annotation instance - this returns the portion of a text Sofa that the annotation spans. It also is used to insure that the Annotation is indexed only in the CAS View associated with this Sofa.

A built-in type, uima.cas.AnnotationBase, is provided by UIMA to allow users to extend the Annotation capabilities to different kinds of Annotations. The AnnotationBase type has one feature, the SofaRef, which holds a pointer to the SofaFS feature structure, another built-in type that is used to represent a Sofa in the framework. The SofaFS feature is automatically set when creating an annotation (any type derived from the built-in uima.cas.AnnotationBase type); it should not be set by the user.

There is one method, getView(), provided by AnnotationBase that returns the CAS View for the Sofa the annotation is pointing at. Note that this method always returns a CAS, even when applied to JCas annotation instances.

The built-in type uima.tcas.Annotation extends uima.cas.AnnotationBase and adds two features, a begin and an end feature, which are suitable for identifying a span in a text string that the annotation applies to. Users may define other extensions to AnnotationBase with alternative specifications that can denote a particular region within the subject of analysis, as appropriate to their application.

UIMA provides an extension to the basic model of the CAS which supports analysis of multiple views of the same artifact, all contained with the CAS. This chapter describes the concepts, terminology, and the API and XML extensions that enable this.

Multiple CAS Views can simplify things when different versions of the artifact are needed at different stages of the analysis. They are also key to enabling multimodal analysis where the initial artifact is transformed from one modality to another, or where the artifact itself is multimodal, such as the audio, video and closed-captioned text associated with an MPEG object. Each representation of the artifact can be analyzed independently with the standard UIMA programming model; in addition, multi-view components and applications can be constructed.

UIMA supports this by augmenting the CAS with additional light-weight CAS objects, one for each view, where these objects share most of the same underlying CAS, except for two things: each view has its own set of indexed Feature Structures, and each view has its own subject of analysis (Sofa) - its own version of the artifact being analyzed. The Feature Structure instances themselves are in the shared part of the CAS; only the entries in the indexes are unique for each CAS view.

All of these CAS view objects are kept together with the CAS, and passed as a unit between components in a UIMA application. APIs exist which allow components and applications to switch among the various view objects, as needed.

Feature Structures may be indexed in multiple views, if necessary. New methods on CAS Views facilitate adding or removing Feature Structures to or from their index repositories:

aView.addFsToIndexes(aFeatureStructure) aView.removeFsFromIndexes(aFeatureStructure)

specify the view in which this Feature Structure should be added to or removed from the indexes.

Sofas (see Subject of Analysis -- Sofa ) and CAS Views are linked. In this implementation, every CAS view has one associated Sofa, and every Sofa has one associated CAS View.

Naming CAS Views and Sofas

The developer assigns a name to the View / Sofa, which is a simple string (following the rules for Java identifiers, usually without periods, but see special exception below). These names are declared in the component XML metadata, and are used during assembly and by the runtime to enable switching among multiple Views of the CAS at the same time.

The name is called the Sofa name, for historical reasons, but it applies equally to the View. In the rest of this chapter, we'll refer to it as the Sofa name.

Some applications contain components that expect a variable number of Sofas as input or output. An example of a component that takes a variable number of input Sofas could be one that takes several translations of a document and merges them, where each translation was in a separate Sofa. You can specify a variable number of input or output sofa names, where each name has the same base part, by writing the base part of the name (with no periods), followed by a period character and an asterisk character (.*). These denote sofas that have names matching the base part up to the period; for example, names such as base_name_part.TTX_3d would match a specification of base_name_part.*.

Multi-View and Single-View components and applications

Components and applications can be written to be Multi-View or Single-View. Most components used as primitive building blocks are expected to be Single-View. UIMA provides capabilities to combine these kinds of components with Multi-View components when assembling analysis aggregates or applications.

Single-View components and applications use only one subject of analysis, and one CAS View. The code and descriptors for these components do not use the facilities described in this chapter.

Conversely, Multi-View components and applications are aware of the possibility of multiple Views and Sofas, and have code and XML descriptors that create and manipulate them.

How UIMA decides if a component is Multi-View

Every UIMA component has an associated XML Component Descriptor. Multi-View components are identified simply as those whose descriptors declare one or more Sofa names in their Capability sections, as inputs or outputs. If a Component Descriptor does not mention any input or output Sofa names, the framework treats that component as a Single-View component.

A Multi-View component is passed a special kind of a CAS object, called a base CAS, which it must use to switch to the particular view it wishes to process. The base CAS object itself has no Sofa and no ability to use Indexes; only the views have that capability.

Multi-View: additional capabilities

Additional capabilities provided for components and applications aware of the possibilities of multiple Views and Sofas include:

Creating new Views, and for each, setting up the associated Sofa data
Getting a reference to an existing View and its associated Sofa, by name
Specifying a view in which to index a particular Feature Structure instance

Component XML metadata

Each Multi-View component that creates a Sofa or wants to switch to a specific previously created Sofa must declare the name for the Sofa in the capabilities section. For example, a component expecting as input a web document in html format and creating a plain text document for further processing might declare:

<sofaName>rawContent</sofaName>

</inputSofas>
<outputSofas>
<sofaName>detagContent</sofaName>
</outputSofas>
</capability>
</capabilities>

Details on this specification are found reference chapter section on Capabilities . The Component Descriptor Editor supports Sofa declarations on the Capabilities Page (page 12-226).

In addition to components, applications can make use of these capabilities. When an application creates a new CAS, it also creates the initial view of that CAS - and this view is the object that is returned from the create call. Additional views beyond this first one can be dynamically created at any time. The application can use the Sofa APIs described in Chapter 8 to specify the data to be analyzed.

If an Application creates a new CAS, the initial CAS that is created will be a view named "_InitialView". This name can be used in the application and in Sofa Mapping (see Sofa Name Mapping ) to refer to this otherwise unnamed view.

Sofa Name mapping is the mechanism which enables UIMA component developers to choose locally meaningful Sofa names in their source code and let aggregate, collection processing engine developers, and application developers connect output Sofas created in one component to input Sofas required in another.

At a given aggregation level, the assembler or application developer defines names for all the Sofas, and then specifies how these names map to the contained components, using the Sofa Map.

Consider annotator code to create a new CAS view:

CAS viewX = cas.createView("X");

Or code to get an existing CAS view:

CAS viewX = cas.getView("X");

Without Sofa name mapping the SofaID for the new Sofa will be “X”. However, if a name mapping for “X” has been specified by the aggregate or CPE calling this annotator, the actual SofaID in the CAS can be different.

All Sofas in a CAS must have unique names. This is accomplished by mapping all declared Sofas as described in the following sections. An attempt to create a Sofa with a SofaID already in use will throw an exception.

Name mapping must not use the "." (period) character. Runtime Sofa mapping maps names up to the "." and appends the period and the following characters to the mapped name.

Name Mapping in an Aggregate Descriptor

For each component of an Aggregate, name mapping specifies the conversion between component Sofa names and names at the aggregate level.

Here's an example. Consider two Multi-View annotators to be assembled into an aggregate which takes an audio segment consisting of spoken English and produces a German text translation.

The first annotator takes an audio segment as input Sofa and produces a text transcript as output Sofa. The annotator designer might choose these Sofa names to be “AudioInput" and “TranscribedText".

The second annotator is designed to translate text from English to German. This developer might choose the input and output Sofa names to be “EnglishDocument" and “GermanDocument", respectively.

In order to hook these two annotators together, the following section would be added to the top level of the aggregate descriptor:

<sofaMappings> <sofaMapping> <componentKey>SpeechToText</componentKey> <componentSofaName>AudioInput</componentSofaName> <aggregateSofaName>SegementedAudio</aggregateSofaName> </sofaMapping> <sofaMapping> <componentKey>SpeechToText</componentKey> <componentSofaName>TranscribedText</componentSofaName> <aggregateSofaName>EnglishTranscript</aggregateSofaName> </sofaMapping> <sofaMapping> <componentKey>EnglishToGermanTranslator</componentKey> <componentSofaName>EnglishDocument</componentSofaName> <aggregateSofaName>EnglishTranscript</aggregateSofaName> </sofaMapping> <sofaMapping> <componentKey>EnglishToGermanTranslator</componentKey> <componentSofaName>GermanDocument</componentSofaName> <aggregateSofaName>GermanTranslation</aggregateSofaName> </sofaMapping> </sofaMappings>

The Component Descriptor Editor supports Sofa name mapping in aggregates and simplifies the task. See Sofa name mappings for details.

Name Mapping in a CPE Descriptor

The CPE descriptor aggregates together a Collection Reader and CAS Processors (Annotators and CAS Consumers). Sofa mappings can be added to the following elements of CPE descriptors: <collectionIterator>, <casInitializer> and the <casProcessor>. To be consistent with the organization of CPE descriptors, the maps for the CPE descriptor are distributed among the XML markup for each of the parts (collectionIterator, casInitializer, casProcessor). Because of this the<componentKey> element is not needed. Finally, rather than sub-elements for the parts, the XML markup for these uses attributes. See <sofaNameMappings> Element .

Here's an example. Let’s use the aggregate from the previous section in a collection processing engine. Here we will add a Collection Reader that outputs audio segments in an output Sofa named “nextSegment". Remember to declare an output Sofa nextSegment in the collection reader description. We’ll add a CAS Consumer in the next section.

At this point the CAS Processor section for the aggregate does not need any Sofa mapping because the aggregate input Sofa has the same name, "SegementedAudio", as is being produced by the Collection Reader.

Specifying the CAS View for a Single-View Component

Single-View components receive a Sofa named "_InitialView", or a Sofa that is mapped to this name.

For example, assume that the CAS Consumer to be used in our CPE is a Single-View component that expects the analysis results associated with the input CAS, and that we want it to use the results from the translated German text Sofa. The following mapping added to the CAS Processor section for the CPE will instruct the CPE to get the CAS view for the German text Sofa and pass it to the CAS Consumer:

An alternative syntax for this kind of mapping is to simply leave out the component sofa name in this case.

Name Mapping in a UIMA Application

Applications which instantiate UIMA components directly using the UIMAFramework methods can also create a top level Sofa mapping using the “additional parameters" capability.

//create a "root" UIMA context for your whole application

UimaContext rootContext = UIMAFramework.newUimaContext(UIMAFramework.getLogger(), UIMAFramework.newDefaultResourceManager(), UIMAFramework.newConfigurationManager());

input = new XMLInputSource("test.xml"); desc = UIMAFramework.getXMLParser().parseAnalysisEngineDescription(input);

//setup sofa name mappings using the api

HashMap sofamappings = new HashMap(); sofamappings.put("localName1","globalName1"); sofamappings.put("localName2","globalName2");

//create a UIMA Context for the new AE we are about to create

//first argument is unique key among all AEs used in the application UimaContextAdmin childContext = uimaContext.createChild("myAE", sofamap);

//instantiate AE, passing the UIMA Context through the additional //parameters map

Map additionalParams = new HashMap(); additionalParams.put(Resource.PARAM_UIMA_CONTEXT, childContext);

anAnnotator = UIMAFramework.produceAnalysisEngine(desc,additionalParams);

Sofa mappings are applied from the inside out, i.e., local to global. First, any aggregate mappings are applied, then any CPE mappings, and finally, any specified using this "additional parameters" capability.

Name Mapping for Remote Services

Currently, no client-side Sofa mapping information is passed from a UIMA client to a remote service. This can cause complications for UIMA services in a Multi-View application.

Remote services using the SOAP transport will work only if the service is Single-View, or if the Sofa names expected by the service match the Sofa names produced by the client.

Sofa name mapping is supported when running in "integrated" CPM mode (or without the CPM) using Vinci, by renaming the Sofas specified by the service descriptor, if necessary.

The JCas interface to the CAS can be used with any / all views, as well as the base CAS sent to Multi-View components. You can always get a JCas object from an existing CAS object by using the method getJCas(); this call will create the JCas if it doesn't already exist. If it does exist, it just returns the existing JCas that corresponds to the CAS.

JCas implements the getView(...) method, enabling switching to other named views, just like the corresponding method on the CAS. The JCas version, however, returns JCas objects, instead of CAS objects, corresponding to the view.

The UIMA SDK contains a simple Sofa example application which demonstrates many Sofa specific concepts and methods. The source code for the application driver is in docs/examples/src/com/ibm/uima/examples/SofaExampleApplication.java and the Multi-View annotator is given in SofaExampleAnnotator.java in the same directory.

This sample application demonstrates a language translator annotator which expects an input text Sofa with an English document and creates an output text Sofa containing a German translation. Some of the key Sofa concepts illustrated here include:

Sofa creation.
Access of multiple CAS views.
Unique feature structure index space for each view.
Feature structures containing cross references between annotations in different CAS views.
The strong affinity of annotations with a specific Sofa.

Annotator Descriptor

The annotator descriptor in docs/examples/descriptors/analysis_engine/SofaExampleAnnotator.xml declares an input Sofa named "EnglishDocument" and an output Sofa named “GermanDocument". A custom type “CrossAnnotation" is also defined:

<typeDescription> <name>sofa.test.CrossAnnotation</name> <description/> <supertypeName>uima.tcas.Annotation</supertypeName> <features> <featureDescription> <name>otherAnnotation</name> <description/> <rangeTypeName>uima.tcas.Annotation</rangeTypeName> </featureDescription> </features> </typeDescription>

The CrossAnnotation type is derived from uima.tcas.Annotation and includes one new feature: a reference to another annotation.

Application Setup

The application driver instantiates an analysis engine, seAnnotator, from the annotator descriptor, obtains a new base CAS using that engine’s CAS definition, and creates the expected input Sofa using:

CAS cas = seAnnotator.newCAS(); CAS aView = cas.createView("EnglishDocument");

Since seAnnotator is a primitive component, and no Sofa mapping has been defined, the SofaID will be “EnglishDocument". Local Sofa data is set using:

aView.setDocumentText("this beer is good");

At this point the CAS contains all necessary inputs for the translation annotator and its process method is called.

Annotator Processing

Annotator processing consists of parsing the English document into individual words, doing word-by-word translation and concatenating the translations into a German translation. Analysis metadata on the English Sofa will be an annotation for each English word. Analysis metadata on the German Sofa will be a CrossAnnotation for each German word, where the otherAnnotation feature will be a reference to the associated English annotation.

Code of interest includes two CAS views:

// get View of the English text Sofa engView = aCas.getView("EnglishDocument"); // Create the output German text Sofa germView = aCas.createView("GermanDocument");

the indexing of annotations with the appropriate view:

engView.addFsToIndexes(engAnnot); . . . germView.addFsToIndexes(germAnnot);

and the combining of metadata belonging to different Sofas in the same feature structure:

// add link to English text germAnnot.setFeatureValue(other, engAnnot);

Back in the Application, accessing the results of analysis

Analysis results for each Sofa are dumped independently by iterating over all annotations for each associated CAS view. For the English Sofa:

//get annotation iterator for this CAS FSIndex anIndex = aView.getAnnotationIndex(); FSIterator anIter = anIndex.iterator(); while (anIter.isValid()) { AnnotationFS annot = (AnnotationFS) anIter.get(); System.out.println(" " + annot.getType().getName() + ": " + annot.getCoveredText()); anIter.moveToNext(); }

Iterating over all German annotations looks the same, except for the following:

if (annot.getType() == cross) { AnnotationFS crossAnnot = (AnnotationFS) annot.getFeatureValue(other); System.out.println(" other annotation feature: " + crossAnnot.getCoveredText()); }

Of particular interest here is the built-in Annotation type method getCoveredText(). This method uses the “begin" and “end" features of the annotation to create a substring from the CAS document. The SofaRef feature of the annotation is used to identify the correct Sofa's data from which to create the substring.

The example program output is:

---Printing all annotations for English Sofa--- uima.tcas.DocumentAnnotation: this beer is good uima.tcas.Annotation: this uima.tcas.Annotation: beer uima.tcas.Annotation: is uima.tcas.Annotation: good

---Printing all annotations for German Sofa--- uima.tcas.DocumentAnnotation: das bier ist gut sofa.test.CrossAnnotation: das other annotation feature: this sofa.test.CrossAnnotation: bier other annotation feature: beer sofa.test.CrossAnnotation: ist other annotation feature: is sofa.test.CrossAnnotation: gut other annotation feature: good

The recommended way to deliver a particular CAS view to a Single-View component is to use by Sofa-mapping in the CPE and/or aggregate descriptors.

For Multi-View components or applications, the following methods are used to create or get a reference to a CAS view for a particular Sofa:

Creating a new View:

JCas newView = aJCas.createView(String localNameOfTheViewBeforeMapping); CAS newView = aCAS .createView(String localNameOfTheViewBeforeMapping);

Getting a View from a CAS or JCas:

JCas myView = aJCas.getView(String localNameOfTheViewBeforeMapping); CAS myView = aCAS .getView(String localNameOfTheViewBeforeMapping);

The following methods are useful for all annotators and applications:

Setting Sofa data for a CAS or JCas:

aCasOrJCas.setDocumentText(String docText);

aCasOrJCas.setSofaDataString(String docText, String mimeType);

aCasOrJCas.setSofaDataArray(FeatureStructure array, String mimeType);

aCasOrJCas.setSofaDataURI(String uri, String mimeType);

Getting Sofa data for a particular CAS or JCas:

String doc = aCasOrJCas.getDocumentText();
String doc = aCasOrJCas.getSofaDataString();

FeatureStructure array = aCasOrJCas.getDataArray();

String uri = aCasOrJCas.getDataURI();

InputStream is = aCasOrJCas.getSofaDataStream();

The major change for v2.0 is related to the support of Single-View components and applications. Given an analysis engine, ae, the API

CAS cas = ae.newCas();

used to return the base CAS. Now it returns a view of the Sofa named “_InitialView”. This Sofa will actually only be created if any Sofa data is set for this view. The initial view is used for Single-View applications and Multi-View annotators with no Sofa mapping.

The process method of Multi-View annotators receive the base CAS, however the base CAS no longer has an index repository to hold “global” data. Global data needs to be put in a specific named CAS view of your choice.

Because of these changes, the following scenarios will break with v2.0 clients:

Any version 1.x services (you must migrate the services to version 2).
Applications or components explicitly referencing "_DefaultTextSofaName" in code or descriptors.
Multi-View applications using the Base CAS index repository.