Apache UIMA C++ (Unstructured Information Management Architecture) v2.2.2 Release Notes

1. What is UIMA?
2. Major Changes in this Release
3. Migrating from IBM UIMA C++ to Apache UIMA C++
4. How to Get Involved
5. How to Report Issues
6. More Documentation on Apache UIMA C++

1. What is UIMA?

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. UIMA is a framework and SDK for developing such applications. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example "language identification" -> "language specific segmentation" -> "sentence boundary detection" -> "entity detection (person/place names etc.)". Each component must implement interfaces defined by the framework and must provide self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

Apache UIMA is an Apache-licensed open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS , a standards organization). We invite and encourage you to participate in both the implementation and specification efforts.

UIMA is a component framework for analysing unstructured content such as text, audio and video. It comprises an SDK and tooling for composing and running analytic components written in Java and C++, with some support for Perl, Python and TCL.

2. Major Changes in this Release

This section describes what has changed between version 1.4.4 and version 2.2.2 of UIMA C++. A migration guide is provided below that describes the required updates to your C++ code and descriptors. See Section 3, "Migrating from IBM UIMA C++ to Apache UIMA C++".

2.1. Complete Content for Build, Test and Package

This release includes a test suite for the uimacpp library. Also included are the tools to build both source and binary distribution packages.

2.2. Extended Platform Support

On 64-bit Unix platforms the Apache UIMA C++ framework can be built as a 64-bit library. This enables C++, Perl, Python and Tcl analytics to fully utilize a 64-bit address space. Both XML and binary CAS serialization formats are compatible between 32 and 64-bit builds.

MacOSX is now fully supported for SDK build and use.

2.3. Better Integration with Java SDK

The Apache UIMA SDK shell scripts and Eclipse run configurations set native environment paths assuming the UIMA C++ SDK is installed directly under $UIMA_HOME. This enables the standard UIMA SDK tools to work seemlessly with C++ based annotators.

On Unix platforms, the UIMA C++ examples directory can be loaded as an Eclipse CDT project, supporting development of both UIMA C++ and Java components in the same Eclipse IDE.

By default, when a uimacpp annotator is instantiated from Java, the annotator runs in the JVM process with communication via the JNI. Multiple uimacpp annotators instantiated in the same JVM must share the same native environment, therefor they must share the same version UIMA C++ framework. As before, a uimacpp annotator can be isolated by wrapping it as a Vinci service.

A new approach is provided in this release which allows process isolation of uimacpp annotators without wrapping each one in a JVM. When deployed from Java as a UIMA-AS service, a uimacpp annotator is spawned by the JVM as native process. The native UIMA-AS service communitates to clients via JMS messaging, completely independently of the JVM. However, the native service connects back to the JVM to enable JMX monitoring and logfile integration with other UIMA annotators running in the same JVM.

2.4. C++ Namespace and Module Name Changes

The UIMA C++ namespace and shared library has changed from "taf" to "uima". Environment variable TAFROOT has changed to UIMACPP_HOME. All of the source files have dropped the prefix "taf_". SDK header files have moved from $TAFROOT/include/ to $UIMACPP_HOME/include/uima/.

2.5. XML Descriptor Changes

The XML namespace in UIMA component descriptors has changed from http://uima.watson.ibm.com/resourceSpecifier to http://uima.apache.org/resourceSpecifier. The value of the <frameworkImplementation> for C++ components must now be org.apache.uima.cpp. Although taeDescription is still supported, the use of analysisEngineDescription is recommended.

2.6. TCAS replaced by CAS

In Apache UIMA the TCAS interface has been removed. All uses of it must now be replaced by the CAS interface. All methods that used to be defined on TCAS were moved to CAS. All annotators should now derive from class Annotator, although for backwards compatibility C++ annotators can still derive from the class TextAnnotator. For all C++ component types, the CAS delivered to the process method will be a base CAS if Sofa capabilities are declared in the component descriptor, else the selected CAS view.

The method

CAS.getTCAS(getSofa(getAnnotatorContext().mapToSofaID("SofaName")))

has been replaced with

CAS->getView("SofaName")

as the Sofa mapping code has been integrated into the CAS.

2.7. Support added for XMI Serialization

The proposed standard for XML interchange of CAS data, XMI serialization, is now supported by UIMA C++. The C++ application driver, runAECpp, has a new option to specify XMI format input files, and the output format is now XMI.

XMI serialization is also key to implementing the UIMA-AS service wrapper for uimacpp-based annotators.

2.8. Building the SDK on Unix is Simplified

The Unix build is simplified by redistributing GNU automake output files in the source tarball. When building from an SVN checkout, up-to-date versions of GNU automake, autoconf and libtool are still required.

3. Migrating from IBM UIMA C++ to Apache UIMA C++

Although not required, CPP component descriptors of type taeDescription should be changed to type analysisEngineDescription.

3.1. Migrating C++ Source Code

This section describes what source code changes are required to migrate from UIMA C++ version 1.4.4 to Apache UIMA C++ v2.2.2. Please note that the first two changes are order dependent.

Replace [case sensitive] all occurances of getTCAS with getView
Replace [case sensitive] all occurances of TCAS with CAS
Replace [case sensitive] all occurances of TAF_ with UIMA_
Replace [case sensitive] all occurances of taf_ with uima/
Replace "tafapi.hpp" with "uima/api.hpp"
Replace TextAnnotator with Annotator
Replace the generic C API wrapper, usually at the bottom of a cpp component, with the MAKE_AE() macro. See sample code in $UIMACPP_HOME/examples/src

3.1. Migrating Scriptator Source Code

Tcl source code using variables of type TCAS should use CAS instead. No changes should be necessary for Perl or Python source.