Apache UIMA (Unstructured Information Management Architecture) v2.7.0 Release Notes

What is UIMA?
Major Changes in this Release
How to Get Involved
How to Report Issues
List of JIRA Issues Fixed in this Release

1. What is UIMA?

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. UIMA is a framework and SDK for developing such applications. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example "language identification" -> "language specific segmentation" -> "sentence boundary detection" -> "entity detection (person/place names etc.)". Each component must implement interfaces defined by the framework and must provide self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

Apache UIMA is an Apache-licensed open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS, a standards organization). We invite and encourage you to participate in both the implementation and specification efforts.

UIMA is a component framework for analysing unstructured content such as text, audio and video. It comprises an SDK and tooling for composing and running analytic components written in Java and C++, with some support for Perl, Python and TCL.

Major Changes in this Release

Java 7 minimum level

Java 7 is now the minimum level of Java required.

Several JVM properties support backwards compatibility

See the README and the Reference chapter for a description of these.

JSON serialization support

JSON serialization support is added for Type System Descriptions, and for CASs. Several formats for JSON CAS serialization are provided, please see the chapter in the UIMA reference documentation for details.

Sorted and Bag indexes no longer store multiple instances of identical FSs

The meaning of "bag" and "sorted" index has been made consistent with how these are handled when sending CASes to remote services. This means that adding the same identical FS to the indexes, multiple times, will no longer add duplicate index entries. And, removing a FS from an index is guaranteed that that particular FS will no longer be in any index. (Before, if you had added a particular FS to a sorted or bag index, multiple times, the remove behavior would remove just one of the instances). Deserialization of CASes sent to remote services has never added Feature Structures to the index multiple times, so this change makes that behavior consistent. For more details, see Jira issue UIMA-3399.

Because some users may need the previous behavior that permitted duplicates of identical Feature Structures in the sorted and bag indexes. this change can be disabled, by running the JVM with the defined property "uima.allow_duplicate_add_to_indexes".

Index corruption avoidance

To prevent potential index corruption, UIMA now recovers (unless disabled by "-Duima.disable_auto_protect_indexes") from illegal modifications of features.

These are modifications to features used as index keys, done while the Feature Structure being modified is currently in one or more indexes (see Jira issue UIMA-4135).

Corruption is prevented by first removing the feature structure being updated, then doing the update, and then adding the feature structure back to the indexes. Because these actions can affect performance, it is recommended that you run with JVM property "-Duima.report_fs_update_corrupts_index" in order to see if any user code has this problem, and fix these via redesign, or by wrapping the affected areas with a form of protectIndexes(), which does the needed removes and add-backs under your control, so you can do several feature updates at once, before adding the feature structure back. protectIndexes is described in the CAS Reference Chapter and the CAS Javadocs; you can also use it with the JCas.

Because this protection is automatic and hidden, if you are iterating over sorted or set indexes, the automatic recovery may cause unexpected ConcurrentModificationExceptions to be thrown by the iterator when advancing. To work around this, either stop modifying features which are used as keys in the index being iterated over, or use Snapshot iterators (see following).

New Snapshot iterators won't throw ConcurrentModificationException

A long-standing difficulty with Feature Structure iterators, namely, that adding to / removing from the underlying index being iterated over is not allowed while iterating (unless you use a moveToXXX kind of operation to "reset" the iterator), is addressed with a new class of "snapshot" iterators. These take a snapshot of the state of the index when the iterator is created; subsequent modifications to the index are then permitted, while the iterator continues to iterate over the snapshot it created; these iterators do not throw ConcurrentModificationException. The implementation of this feature is via a new method on FSIndex, withSnapshotIterators(), which creates a light-weight copy of the the FSIndex instance whose iterator method iterators gets the Snapshot kind. This approach allows using the new index in Java's "extended for" statement.

The current implementation of the snapshot iterators makes a snapshot of the index being iterated over, at creation time, which has a cost in space and time.

Other changes

Some of the other major changes are listed here; for the complete list, see the Issues Fixed report.

making the JCasGen Eclipse plugin work with more varieties of specifications for class paths. Jira issues: UIMA-4080/ 4081
moveTo(a_Feature_Structure) or creating a new iterator to start at a feature sometimes went to the wrong place. Jira issues: UIMA-4094 and UIMA-4105.
deserialization of deltaCAS when modifying existing indexed Feature Structures could corrupt the indexes. Jira issue: UIMA-4100.
default bag indexes will now be created if there are only Set indexes. Jira issue: UIMA-4111.
Xmi CAS Serialization now checks to see that list and array feature values marked as multipleReferencesAllowed=false (or not marked at all, which defaults to multipleReferencesAllowed=false) are not multiply-referenced. If they are, they continue to be serialized as if they are independent objects (as was previously done), but now a new warning message is issued. Because there can be a huge number of these messages, they are automatically throttled down, to prevent running out of room in the error logs.

The message strings for these look like "Feature [some-feature-name] is marked multipleReferencesAllowed=false, but it has multiple references. These will be serialized in duplicate."
The CasCopier now checks to insure that the range type of the target feature has the same name as the range type of the source feature, which catches errors when two different type systems are used for the source and target. For example, this now prevents a feature with range type "uima.cas.Float" from being copied into one with range type "uima.cas.String".

Performance change highlights

Iterators obtained for Bag indexes and from the method getAllIndexedFS( ... type ... ) are by definition, unordered. The implementation for these iterators is now much faster, taking advantage of the unordered aspect of these things and returning items in a more efficient sequence. Jira issue: UIMA-4166.
CasCopier has been reimplemented and is approximately 5-10 times faster.

The complete list of fixes is here.