IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies.
The UIMA framework provides a run-time environment in which developers can plug in and run their UIMA component implementations and with which they can build and deploy UIM applications. The framework is not specific to any IDE or platform.
The UIMA Software Development Kit (SDK) includes an all-Java implementation of the UIMA framework for the development, description, composition and deployment of UIMA components and applications. It also provides the developer with an Eclipse-based (www.eclipse.org) development environment that includes a set of tools and utilities for using UIMA.
This chapter is the intended starting point for readers that are new to the UIMA SDK. It includes this introduction and the following sections:
Chapter |
Description |
Overviews |
|
UIMA SDK Overview (This Chapter) |
Lists the documents provided in the UIMA SDK documentation set. Provides a recommended path through the documentation for getting started using UIMA. Includes release notes. Provides a brief high-level description of the different software modules included in the UIMA SDK. |
UIMA Conceptual Overview |
Provides a broad conceptual overview of the UIMA component architecture making contextual references to the other documents in the UIMA SDK documentation set that provide more detail. |
Setting up |
|
UIMA Eclipse Tooling Installation and Setup |
Provides step-by-step instructions for installing the UIMA SDK in the Eclipse Interactive Development Environment. |
Developer's Guides |
|
Annotator and AE Developer's Guide |
Tutorial-style guide for building UIMA annotators and analysis engines. This chapter introduces the developer to creating type systems and using UIMA’s common data structure, the CAS or Common Analysis Structure. It demonstrates how to use built in tools to specify and create basic UIMA analysis components. |
CPE Developer's Guide |
Tutorial-style guide for building UIMA collection processing engines. These manage the analysis of collections of documents from source to sink. |
Application Developer's Guide |
Tutorial-style guide for using UIMA SDK to create, run and manage UIMA components from your application. Includes integration with semantic search engine and description of a simple GUI provided for submitting and running Semantic Search queries that can exploit UIMA analysis. Also describes APIs for saving and restoring the contents of a CAS using an XML format called XCAS. |
Flow Controller Developer's Guide |
When multiple components are combined in an Aggregate, each CAS flow among the various components. UIMA provides two built-in flows, and also allows custom flows to be implemented. |
Developing Applications using Multiple Subjects of Analysis (Sofas) |
A single CAS maybe associated with multiple subjects of analysis (Sofas). These are useful for representing and analyzing different formats or translations of the same document. For multi-modal analysis, Sofas are good for different modal representations of the same stream (e.g., audio and close-captions).This chapter provides the developer details on how to use multiple Sofas in an application. |
CAS Multiplier Developer's Guide |
A component may add additional CASes into the workflow. This may be useful to break up a large artifact into smaller units, or to create a new CAS that collects information from multiple other CASes. |
XMI® and EMF Interoperability |
The UIMA Type system and the contents of the CAS itself can be externalized using the XMI standard for XML MetaData. Eclipse Modeling Framework (EMF) tooling can be used to develop applications that use this information. |
Tool User Guides |
|
Component Descriptor Editor |
Describes the features of the Component Descriptor Editor Tool. This tool provides a GUI for specifying the details of UIMA component descriptors, including those for Analysis Engines (primitive and aggregate), Collection Readers, CAS Consumers and Type Systems. |
CPE Configurator |
Describes the User Interfaces and features of the CPE Configurator tool. This tool allows the user to select and configure the components of a Collection Processing Engine and then to run the engine. |
PEAR Packager |
Describes how to use the PEAR Packager utility. This utility enables developers to produce an archive file for an analysis engine that includes all required resources for installing that analysis engine in another UIMA environment. |
PEAR Installer |
Describes how to use the PEAR Installer utility. This utility installs and verifies an analysis engine from an archive file (PEAR) with all its resources in the right place so it is ready to run. |
PEAR Merger User's Guide |
Merges multiple PEAR packages into one. |
Document Analyzer |
Describes the features of a tool for applying a UIMA analysis engine to a set of documents and viewing the results. |
CAS Visual Debugger |
Describes the features of a tool for viewing the detailed structure and contents of a CAS. Good for debugging. |
JCasGen |
Describes how to run the JCasGen utility, which automatically builds Java classes that correspond to a particular CAS Type System. |
XCAS Viewer |
Describes how to run the supplied viewer for XCASes, used in the examples. |
References |
|
UIMA FAQs |
Frequently Asked Questions about general UIMA concepts. (Not a programming resource.) |
Glossary |
Main UIMA concepts and their basic definitions. |
Component Descriptor Reference |
Provides detailed XML format for all the UIMA component descriptors, except the CPE (see next) |
CPE Descriptor Reference |
Provides detailed XML format for the Collection Processing Engine descriptor. |
JavaDocs |
JavaDocs detailing the UIMA SDK programming interfaces |
CAS Reference |
Provides detailed description of the principal CAS interface. |
JCas Reference |
Provides details on the JCas, a native Java interface to the CAS. |
Semantic Search Engine Reference |
Describes how to write applications that query a semantic search engine index built using the UIMA SDK. |
PEAR Reference |
Provides detailed description of the deployable archive format for UIMA components. |
XMI CAS Serialization Reference |
Provides details about the XMI CAS Serialization |
Learn how to package up an analysis engine for easy installation into another UIMA environment. Chapter 14, PEAR Packager and Chapter 15, PEAR Installer User's Guide will teach you how to create UIMA analysis engine archives so that you can easily share your components with a broader community.
Version 2.0 provide new capabilities and refines several areas of the UIMA architecture.
UIMA now supports Boolean (bit), Byte, Short (16 bit integers), Long (64 bit integers), and Double (64 bit floating point) primitive types, and arrays of these. These types can be used like all the other primitive types.
Version 1.x made a distinction between Analysis Engines and Text Analysis Engines. This distinction has been eliminated in Version 2 - new code should just refer to Analysis Engines. Analysis Engines can operate on multiple kinds of artifacts, including text.
Version 1.x made a distinction between CASes and TCASes. TCAS are now deprecated; new code should just refer to CASes. The JCas capability to have a Java-friendly way to work with CAS types remains; we clarify that the JCas is just (one of potentially several) interfaces to the CAS.
The APIs for manipulating multiple subjects of analysis (Sofas) and their corresponding CAS Views have been simplified.
Analysis Components, in general, can make use of new capabilities to return multiple new CASes, in addition to returning the original CAS that is passed in. This allows components to have Collection Reader-like capabilities, but be placed anywhere in the flow. See CAS Multiplier Developer's Guide .
A new component, the Flow Controller, can be supplied by the user to implement arbitrary flow control for CASes within an Aggregate. This is in addition to the two built-in flow control choices of linear and language-capability flow. See Flow Controller Developer's Guide .
The search engine that is provided with the UIMA SDK has been upgraded to a later release; it is more scalable and now has the ability to index additional information from Annotations. The SIAPI.pdf reference documentation for this has been updated. The SemanticSearchCasIndexer now supports indexing individual features of annotations in addition to their types.
For the most part, applications and components should work unchanged under version 2.0 However, please note the following non-compatible changes:
TextAnalysisEngine has been deprecated - it is now no different than AnalysisEngine. Previous code that uses this should still continue to work, however.
Methods that were defined on the TCAS interface have been moved to the base CAS interface; the TCAS interface is no longer needed.
The DocumentAnalyzer tool saves outputs in the new XMI serialization format. The XCasAnnotationViewer and SemanticSearchGUI tools can read both the new XMI format and the previous XCAS format.
The UIMA SDK supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.
It includes APIs and tools for creating analysis components. Examples of analysis components include tokenizers, summarizers, categorizers, parsers, named-entity detectors etc. Tutorial examples are provided with the SDK; additional components are available from the community.
The UIMA SDK also includes a semantic search engine for indexing the results of analysis and for using this semantic index to perform more advanced search.
UIMA supports the development and integration of analysis algorithms developed in different programming languages.
The SDK is principally focussed on Java development. It also includes facilities for C++ Enablement for UIMA Components which allow UIMA components to be written in C++ and have access to a C++ version of the CAS. When used in this manner, the Java UIMA framework can incorporate analytic functions written in C++. Optional files included with the UIMA SDK describe this functionality and provide example code. See the Quick Start manual for more information on this.
Other languages, including Python, Perl, and TCL, are being added to the list.
The UIMA architecture supports the development, discovery, composition and deployment of multi-modal analytics, including text, audio and video. Annotations, Artifacts, and S discuss this is more detail.
The SDK is available from IBM's alphaWorks (http://www.alphaworks.ibm.com/tech/uima). The source code for the main UIMA framework is available on SourceForge (http://uima-framework.sourceforge.net ).
Module |
Description |
UIMA Framework Core |
A framework integrating core functions for creating, deploying, running and managing UIMA components, including analysis engines and Collection Processing Engines in collocated and/or distributed configurations. The framework includes an implementation of core components for transport layer adaptation, CAS management, workflow management based on declarative specifications, resource management, configuration management, logging, and other functions. |
C++ and other programming language Interoperability |
Includes C++ CAS and supports the creation of UIMA compliant C++ components that can be deployed in the UIMA run-time through a built-in JNI adapter. This includes high-speed binary serialization. Includes support for creating service-based UIMA engines outside of SDK. This is ideal for wrapping existing code written in different languages. |
Externalized Framework Plug-ins |
Note that interfaces of these components are available to the developer but different implementations are possible in different implementations of the UIMA framework. |
CAS |
These classes provide the developer with typed access to the Common Analysis Structure (CAS), including type system schema, elements, subjects of analysis and indices. Multiple subjects of analysis (Sofas) mechanism supports the independent or simultaneous analysis of multiple views of the same artifacts (e.g. documents), supporting multi-lingual and multi-modal analysis. |
JCas |
An alternative interface to the CAS, providing Java-based UIMA Analysis components with native Java object access to CAS types and their attributes or features, using the JavaBeans conventions of getters and setters. |
Collection Processing Management (CPM) |
Core functions for running UIMA collection processing engines in collocated and/or distributed configurations. The CPM provides scalability across parallel processing pipelines, check-pointing, performance monitoring and recoverability. |
Resource Manager |
Provides UIMA components with run-time access to external resources handling capabilities such as resource naming, sharing, and caching. |
Configuration Manager |
Provides UIMA components with run-time access to their configuration parameter settings. |
Logger |
Provides access to a common logging facility. |
Tools and Utilities |
|
JCasGen |
Utility for generating a Java object model for CAS types from a UIMA XML type system definition. |
Saving and Restoring CAS contents |
APIs in the core framework support saving and restoring the contents of a CAS to streams using an XMI format. |
PEAR packager for Eclipse |
Tool for building a UIMA component archive to facilitate porting, registering, installing and testing components. |
PEAR Installer |
Tool for installing and verifying a UIMA component archive in a UIMA installation. |
PEAR Merger |
Utility that combines multiple PEARs into one. |
Component Descriptor Editor |
Eclipse Plug-in for specifying and configuring component descriptors for UIMA analysis engines as well as other UIMA component types including Collection Readers and CAS Consumers. |
CPE Configurator |
Graphical tool for configuring Collection Processing Engines and applying them to collections of documents. |
Java Annotation viewer |
Viewer for exploring annotations and related CAS data. |
CAS Visual Debugger |
Provides developer with detailed visual view of the contents of a CAS. |
Document Analyzer |
Graphical tool for applying analysis engines to sets of documents and viewing results. |
Example Analysis Components |
|
Semantic Search CAS Indexer |
CAS Consumer that uses the semantic search engine indexer to build an index from a stream of CASes. Requires the semantic search engine (included). |
Database Writer |
CAS Consumer that writes the content of selected CAS types
into a relational database, using JDBC. This code is in the doc/examples/src/com/ibm/uima/examples/ |
Annotators |
Set of simple annotators meant for pedagogical purposes. Includes: Date/time, Room-number, Regular expression, Tokenizer, and Meeting-finder annotator. There are also sample Annotators in C++ and Python. There are sample CAS Multipliers as well. |
Flow Controllers |
There is a sample flow-controller based on the whiteboard concept of sending the CAS to whatever annotator hasn't yet processed it, when that annotator's inputs are available in the CAS. |
File System Collection Reader |
Simple Collection Reader for pulling documents from the file system and initializing CASes. |
XMI Collection
Reader, |
Reads and writes the CAS in XMI format |
Search Components |
|
Semantic Search Engine |
Search Engine that supports searching over results of analysis including annotations and nested annotations using the "XML Fragment" query language. |
Components not currently available in this release of the UIMA SDK. |
If interested in these extensions please contact the UIMA team at IBM. T.J. Watson Research Center via www.ibm.com/research/uima |
Semantic search and Analysis Workbench (SAW) |
Graphical User Interface for applying analysis to build search indices and DBs and query interfaces for searching/exploring analysis results. Uses the semantic search engine and the EKDB (see below). |
Extracted Knowledge Database (EKDB) |
Database schema and APIs for creating and populating a relational database with analysis results including entity and relation annotations. Includes a CAS Consumer that populates the database. Semantic Analysis Workbench provides a front-end to this database and to the Semantic Search Engine’s query processor. |