Apache UIMA C++ Overview and Setup

Authors: The Apache UIMA Development Community

Version 2.3.0

Incubation Notice and Disclaimer. Apache UIMA is an effort undergoing incubation at the Apache Software Foundation (ASF). Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

License and Disclaimer. The ASF licenses this documentation to you under the Apache License, Version 2.0 (the "License"); you may not use this documentation except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, this documentation and its contents are distributed under the License on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Trademarks. All terms mentioned in the text that are known to be trademarks or service marks have been appropriately capitalized. Use of such terms in this book should not be regarded as affecting the validity of the the trademark or service mark.

August, 2008


1.0 Apache UIMA C++ Overview

The Apache UIMA C++ framework allows the creation of UIMA compliant analysis engines from analytics written in C++ and several scripting langauges that can utilize C++ libraries. A rich set of standard UIMA interface methods minimizes the effort to extract input data from a CAS and then update the CAS with the analytic results. The UIMA framework transparently moves the CAS between Java and C++ components and between UIMA components running in different processes.

A UIMA C++ component is identified as such in its component descriptor:

UIMA C++ annotators can be utilized from C++ applications, from Java applications, and can be aggregated with other UIMA-compliant annotators. For C++ applications the UIMA C++ framework has APIs to parse component descriptors, then instantiate and call analysis engines. A C++ test driver is available so that a UIMA C++ analytic can be developed and tested with standard native programming tools; no programing in Java is required. On the other hand, for a more consistent development environment, Eclipse can provide a single IDE for both Java and C++ components using the CDT.

For Java applications, there are two approaches to integrating UIMA C++ analytics: using the Java Native Interface (JNI), and using a C++ service wrapper to create a UIMA AS compatible service. Using the JNI, a C++ analysis engine can be used anywhere a Java analysis engine is used; in this case a Java proxy will instantiate the uimacpp framework though the JNI. Note that if more than one C++ component is used in the same JVM, they must share the same native environment. Using UIMA AS, a C++ component can be started as a separate process, and therefor each component can have different native environments, if desired. When C++ is launched automatically from Java, logging and JMX monitoring of the annotator is done via the JVM.

1.1 UIMA C++ Functionality

The UIMA C++ framework implements a subset of that in Java. Major functionality consists of:

Major UIMA functionality missing in the C++ framework:

UIMA compliant annotators can be written in Perl, Python and Tcl using C++ annotators included in this package. For further details see Perl, Python and Tcl.

The UIMA C++ framework depends on Unicode support from the ICU (see http://www.ibm.com/software/globalization/icu), XML parsing support from xerces (see http://xml.apache.org/xerces-c/) and platform portability from APR (see http://apr.apache.org/).

API documentation for the C++ framework is available here.

1.2 Supported Platforms

Linux® Intel® 32 and 64-bit platforms, MacOSX and Windows® 2000/XP.

1.3 Binary Distribution

Binary distributions are in compressed tarfiles for Linux and zipfiles for Windows.

1.4 Source Distribution

The source code distributions are in a compressed tarfile for Unix builds and a zipfile for Windows builds.

2.0 Installing and Testing UIMA C++

The binary distribution can be installed anywhere. However, when installed on a system with the Apache UIMA Java distribution, unpacking directly underneath $UIMA_HOME provides better interoperability.

2.1 Setting Environment Variables

Set UIMACPP_HOME to the installed location of the UIMA C++ SDK.

Both the UIMA C++ framework and the users' C++ components are implemented as shared libraries and must be available to the native library loader. On Linux these directories must be in the LD_LIBRARY_PATH, in DYLD_LIBRARY_PATH on MacOSX and on Windows in the system PATH. UIMA C++ executables should be added to the system PATH.

On Linux

On Windows

2.2 Verifying Your Installation

To test the installation, set the environment variables as described above and follow these directions:

2.2.1 On Linux

The build should create a shared library, DaveDetector.so, which must be placed in the LD_LIBRARY_PATH.

Run this C++ annotator as follows:

The runAECpp driver will process the input text file and DaveDetector should find a Dave in it.

2.2.2 On Windows

The build should create a shared library, DaveDetector.dll, which must be placed in the PATH.

Run this C++ annotator as follows:

The runAECpp driver will process the input text file and DaveDetector should find a Dave in it.

2.2.3 More Info on UIMA C++ Examples

For further details about how to build and run other examples see C++ Examples

2.3 Testing Interoperability with UIMA on Java

To test the interoperability with UIMA Java SDK, make sure UIMA_HOME is set to the location of the UIMA SDK, and that its bin directory is in the PATH. Run DaveDetector as follows:

2.3.1 On Linux

2.3.2 On Windows

The runAE driver will process all files in the data directory and DaveDetector should find Dave in some of them.

2.3 Testing Interoperability with UIMA AS

UIMA AS is an add-on to the core UIMA package; it must be separately downloaded and installed. To test interoperability with UIMA AS, make sure UIMA_HOME is set to the location of the combined SDK, and that its bin directory is in the PATH. Run the C++ MeetingAnnotator as follows:

2.3.1 Adjust example Deployment Descriptor

The UIMA AS example C++ deployment descriptor includes specifications for the UIMACPP environment. Following the convention for UIMA examples, this descriptor specifies paths starting with "C:/Program Files/apache-uima". This path prefix must be changed in the descriptor to be the parent directory of UIMACPP.

2.3.2 On Linux

2.3.3 On Windows

The runRemoteAsyncAE driver will deploy the MeetingAnnotator binary from the distribution package as a UIMA AS service and send it an empty CAS to process.

3.0 Developing a C++ Component

3.1 Developing UIMA C++ Components without Java

It is advantageous to develop C++ components as stand-alone C++ applications. The program runAECpp found in $UIMACPP_HOME/bin is a native utility that instantiates the specified C++ annotator, imports input files into CAS objects, and for each input calls the annotator's process method. runAECpp supports input files in plain text (the default), as well as files in XMI and XCAS format. The output CASes are optionally saved.

The options -r, -rand and -rdelay are quite useful for detecting threading problems with annotators intended for multi-threaded deployments.

Sample XMI and XCAS format CAS files are included with the UIMA C++ examples. After building the SofaExampleAnnotator example as described above for DaveDetector, try:

For further details about these and other examples see C++ Examples

3.2 Debugging UIMA C++ Components

The component driver, runAECpp, simplifies running the C++ component under a native debugger.

3.2.1 Debugging on Windows

The UIMA C++ framework has special provisions for debugging on Windows. UIMA C++ components built debug should link to a debug version of the framework, uimaD.dll. The debug framework automatically appends "D" to the name of C++ components before trying to load them. This applies to annotators and URI scheme handlers. All UIMA C++ example code follow this convention.

Note also that the runAECppD version of the component driver should be used with debug components.

3.3 Running C++ Components under Java

Native components running under Java may operate differently than when run from a native application. This is because the JVM uses different default process limits than those used for a native application. For example, the maximum stack size for a thread running under a JVM may be 100KB versus 1MB for a native command line application. Use "java -X" to get more information on non-standard JVM options.

3.3.1 Running Debug Modules under Java on Windows

In order to run UIMA C++ components built debug, Java must load the debug version of the framework. Define the Java system property DEBUG_UIMACPP to specify use of the debug framework. A convenient way to pass JVM properties to UIMA's Java commandline utilities, such as runAE.sh, is to define them in the environmental variable UIMA_JVM_OPTS. For example to run a debug version of DaveDetector from Java:

3.4 Message Logging from a C++ Component

For formal integration with UIMA applications, a logfile interface is available. When a C++ annotator is called from Java, logging messages are integrated into the Java log. If the C++ annotator is called from a native C++ application, such as runAECpp, a local logfile may be created. The name of the logfile is taken from the the environmental parameter, UIMACPP_LOGFILE, and it is opened "append". The default is to disable logging.

Three levels of message logging can be used: Message, Warning and Error. When called from Java the UIMA log level is used to control output. When called from a C++ application an API is available to set the log level; the default level is Error. When called from runAECpp the value of these levels are 0, 1, and 2, respectively.