1. What is UIMA?
2. Major Changes in this Release
3. Migrating from IBM UIMA C++ to Apache UIMA C++
4. How to Get Involved
5. How to Report Issues
6. More Documentation on Apache UIMA C++
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. UIMA is a framework and SDK for developing such applications. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example "language identification" -> "language specific segmentation" -> "sentence boundary detection" -> "entity detection (person/place names etc.)". Each component must implement interfaces defined by the framework and must provide self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.
Apache UIMA is an Apache-licensed open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS , a standards organization). We invite and encourage you to participate in both the implementation and specification efforts.
UIMA is a component framework for analysing unstructured content such as text, audio and video. It comprises an SDK and tooling for composing and running analytic components written in Java and C++, with some support for Perl, Python and TCL.
This section describes what has changed between version 1.4.4 and version 2.2.2 of UIMA C++. A migration guide is provided below that describes the required updates to your C++ code and descriptors. See Section 3, "Migrating from IBM UIMA C++ to Apache UIMA C++".
This release includes a test suite for the uimacpp library. Also included are the tools to build both source and binary distribution packages.
On 64-bit Unix platforms the Apache UIMA C++ framework can be built as a 64-bit library. This enables C++, Perl, Python and Tcl analytics to fully utilize a 64-bit address space. Both XML and binary CAS serialization formats are compatible between 32 and 64-bit builds.
MacOSX is now fully supported for SDK build and use.
The Apache UIMA SDK shell scripts and Eclipse run configurations set native environment paths assuming the UIMA C++ SDK is installed directly under $UIMA_HOME. This enables the standard UIMA SDK tools to work seemlessly with C++ based annotators.
On Unix platforms, the UIMA C++ examples directory can be loaded as an Eclipse CDT project, supporting development of both UIMA C++ and Java components in the same Eclipse IDE.
By default, when a uimacpp annotator is instantiated from Java, the annotator runs in the JVM process with communication via the JNI. Multiple uimacpp annotators instantiated in the same JVM must share the same native environment, therefor they must share the same version UIMA C++ framework. As before, a uimacpp annotator can be isolated by wrapping it as a Vinci service.
A new approach is provided in this release which allows process isolation of uimacpp annotators without wrapping each one in a JVM. When deployed from Java as a UIMA-AS service, a uimacpp annotator is spawned by the JVM as native process. The native UIMA-AS service communitates to clients via JMS messaging, completely independently of the JVM. However, the native service connects back to the JVM to enable JMX monitoring and logfile integration with other UIMA annotators running in the same JVM.
The UIMA C++ namespace and shared library has changed from "taf" to "uima". Environment variable TAFROOT has changed to UIMACPP_HOME. All of the source files have dropped the prefix "taf_". SDK header files have moved from $TAFROOT/include/ to $UIMACPP_HOME/include/uima/.
The XML namespace in UIMA component descriptors has changed from
http://uima.watson.ibm.com/resourceSpecifier to
http://uima.apache.org/resourceSpecifier. The value of the
<frameworkImplementation> for C++ components must now be org.apache.uima.cpp.
Although taeDescription
is still supported, the use of analysisEngineDescription
is recommended.
In Apache UIMA the TCAS interface has been removed. All uses of it must now be
replaced by the CAS interface. All methods that used to be defined on TCAS
were moved to CAS.
All annotators should now derive from class Annotator
, although for backwards
compatibility C++ annotators can still derive from the class TextAnnotator
.
For all C++ component types, the CAS delivered to the process method will be a base CAS if Sofa capabilities are
declared in the component descriptor, else the selected CAS view.
The method
CAS.getTCAS(getSofa(getAnnotatorContext().mapToSofaID("SofaName")))
CAS->getView("SofaName")
The proposed standard for XML interchange of CAS data, XMI serialization, is now supported by UIMA C++. The C++ application driver, runAECpp, has a new option to specify XMI format input files, and the output format is now XMI.
XMI serialization is also key to implementing the UIMA-AS service wrapper for uimacpp-based annotators.
The Unix build is simplified by redistributing GNU automake output files in the source tarball. When building from an SVN checkout, up-to-date versions of GNU automake, autoconf and libtool are still required.
Although not required, CPP component descriptors of type taeDescription
should be changed to type analysisEngineDescription
.
This section describes what source code changes are required to migrate from UIMA C++ version 1.4.4 to Apache UIMA C++ v2.2.2. Please note that the first two changes are order dependent.
getTCAS
with getView
TCAS
with CAS
TAF_
with UIMA_
taf_
with uima/
"tafapi.hpp"
with "uima/api.hpp"
TextAnnotator
with Annotator
Tcl source code using variables of type TCAS should use CAS instead. No changes should be necessary for Perl or Python source.
The Apache UIMA project really needs and appreciates any contributions, including documentation help, source code and feedback. If you are interested in contributing, please visit http://incubator.apache.org/uima/get-involved.html.
The Apache UIMA project uses JIRA for issue tracking. Please report any issues you find at http://issues.apache.org/jira/browse/uima
Please see Overview and Setup for a high level overview of UIMA C++, and Doxygen for details on the UIMA C++ APIs.