What is UIMA?
Major Changes in this Release
How to Get Involved
How to Report Issues
List of JIRA Issues Fixed in this Release
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. UIMA is a framework and SDK for developing such applications. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example "language identification" -> "language specific segmentation" -> "sentence boundary detection" -> "entity detection (person/place names etc.)". Each component must implement interfaces defined by the framework and must provide self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.
Apache UIMA is an Apache-licensed open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS, a standards organization). We invite and encourage you to participate in both the implementation and specification efforts.
UIMA is a component framework for analysing unstructured content such as text, audio and video. It comprises an SDK and tooling for composing and running analytic components written in Java and C++, with some support for Perl, Python and TCL.
Java 7 is now the minimum level of Java required.
See the README and the Reference chapter for a description of these.
JSON serialization support is added for Type System Descriptions, and for CASs. Several formats for JSON CAS serialization are provided, please see the chapter in the UIMA reference documentation for details.
The meaning of "bag" and "sorted" index has been made consistent with how these are handled when sending CASes to remote services. This means that adding the same identical FS to the indexes, multiple times, will no longer add duplicate index entries. And, removing a FS from an index is guaranteed that that particular FS will no longer be in any index. (Before, if you had added a particular FS to a sorted or bag index, multiple times, the remove behavior would remove just one of the instances). Deserialization of CASes sent to remote services has never added Feature Structures to the index multiple times, so this change makes that behavior consistent. For more details, see Jira issue UIMA-3399.
Because some users may need the previous behavior that permitted duplicates of identical Feature Structures in the sorted and bag indexes. this change can be disabled, by running the JVM with the defined property "uima.allow_duplicate_add_to_indexes".
To prevent potential index corruption, UIMA now recovers (unless disabled by
"-Duima.disable_auto_protect_indexes"
)
from illegal modifications of features.
These are modifications to features used as index keys, done while the Feature Structure being modified is currently in one or more indexes (see Jira issue UIMA-4135).
Corruption is prevented by first removing the feature structure being updated,
then doing the update, and then adding the feature structure back to the indexes.
Because these actions can affect performance, it is recommended that you run with JVM property
"-Duima.report_fs_update_corrupts_index" in order to see if any user code has this problem, and fix these via
redesign, or by wrapping the affected areas with a form of protectIndexes()
, which does the
needed removes and add-backs under your control, so you can do several feature updates at once,
before adding the feature structure back. protectIndexes
is described in the
CAS Reference Chapter and the CAS Javadocs; you can also use it with the JCas.
Because this protection is automatic and hidden, if you are iterating over sorted or set indexes, the automatic recovery may cause unexpected ConcurrentModificationExceptions to be thrown by the iterator when advancing. To work around this, either stop modifying features which are used as keys in the index being iterated over, or use Snapshot iterators (see following).
A long-standing difficulty with Feature Structure iterators, namely, that adding to / removing from the underlying index being iterated over is not allowed while iterating (unless you use a moveToXXX kind of operation to "reset" the iterator), is addressed with a new class of "snapshot" iterators. These take a snapshot of the state of the index when the iterator is created; subsequent modifications to the index are then permitted, while the iterator continues to iterate over the snapshot it created; these iterators do not throw ConcurrentModificationException. The implementation of this feature is via a new method on FSIndex, withSnapshotIterators(), which creates a light-weight copy of the the FSIndex instance whose iterator method iterators gets the Snapshot kind. This approach allows using the new index in Java's "extended for" statement.
The current implementation of the snapshot iterators makes a snapshot of the index being iterated over, at creation time, which has a cost in space and time.
Some of the other major changes are listed here; for the complete list, see the Issues Fixed report.
Xmi CAS Serialization now checks to see that list and array feature values marked as multipleReferencesAllowed=false (or not marked at all, which defaults to multipleReferencesAllowed=false) are not multiply-referenced. If they are, they continue to be serialized as if they are independent objects (as was previously done), but now a new warning message is issued. Because there can be a huge number of these messages, they are automatically throttled down, to prevent running out of room in the error logs.
The message strings for these look like "Feature [some-feature-name] is marked multipleReferencesAllowed=false, but it has multiple references. These will be serialized in duplicate."
The complete list of fixes is here.
The Apache UIMA project really needs and appreciates any contributions, including documentation help, source code and feedback. If you are interested in contributing, please visit http://uima.apache.org/get-involved.html.
The Apache UIMA project uses JIRA for issue tracking. Please report any issues you find at http://issues.apache.org/jira/browse/uima