UIMA FAQs

What is UIMA? UIMA stands for Unstructured Information Management Architecture. It is component software architecture for the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search and knowledge management technologies.

UIMA processing occurs through a series of modules called analysis engines. The result of analysis is an assignment of semantics to the elements of unstructured data, for example, the indication that the phrase "Washington" refers to a person’s name or that it refers to a place.

UIMA supports the rendering of these results in conventional structures, for example, relational databases or search engine indices, where the content of the original unstructured information may be efficiently accessed according to its inferred semantics.

UIMA is specifically designed to support the developer in creating, integrating, and deploying components across platforms and among disperse teams working to develop unstructured information management applications.

What's the difference between UIMA and the UIMA SDK? UIMA is an architecture which specifies component interfaces, design patterns, data representations and development roles.

The UIMA Software Development Kit (SDK) is a software system which includes a run-time framework, APIs and tools for implementing, composing, packaging and deploying UIMA components. It comes with a semantic search engine for indexing and querying over the results of analysis.

The UIMA run-time framework allows developers to plug-in their components and applications and run them on different platforms and according to different deployment options that range from tightly-coupled (running in the same process space) to loosely-coupled (distributed across different processes or machines for greater scale, flexibility and recoverability).

What is an Annotation? An annotation is metadata that is associated with a region of a document. It often is a label, typically represented as string of characters. The region may be the whole document.

An example is the label "Person" associated with the span of text "George Washington". We say that "Person" annotates "George Washington" in the sentence "George Washington was the first president of the United States". The association of the label "Person" with a particular span of text is an annotation. Another example may have an annotation represent a topic, like “American Presidents" and be used to label an entire document.

Annotations are not limited to regions of texts. An annotation may annotate a region of an image or a segment of audio. The same concepts apply.

What is the CAS? The CAS stands for Common Analysis Structure. It provides cooperating UIMA components with a common representation and mechanism for shared access to the artifact being analyzed (e.g., a document, audio file, video stream etc.) and the current analysis results.

What does the CAS contain? The CAS is a data structure for which UIMA provides multiple interfaces. It contains and provides the analysis algorithm or application developer with access to

  • the subject of analysis (the artifact being analyzed, like the document),
  • the analysis results or metadata(e.g., annotations, parse trees, relations, entities etc.)
  • indices to the analysis results and
  • the type system (a schema for the analysis results)

A CAS can hold multiple versions of the artifact being analyzed (for instance, a raw html document, and a detagged version, or an English version and a corresponding German version, or an audio sample, and the text that corresponds, etc.). For each version there is a separate instance of the results indices.

Does the CAS only contain Annotations? No. The CAS contains the artifact being analyzed plus the analysis results. Analysis results are those metadata recorded by analysis engines in the CAS. The most common form of analysis result is the addition of an annotation. But an analysis engine may write any structure that conforms to the CAS’s type system into the CAS. These may not be annotations but may be other things, for example links between annotations and properties of objects associated with annotations.

Is the CAS just XML? No, in fact there are many possible representations of the CAS. If all of the analysis engines are running in the same process, an efficient, in-memory data object is used. If a CAS must be sent to an analysis engine on a remote machine, it can be done via an XML or a binary serialization of the CAS.

The UIMA framework provides serialization and de-serialization methods for a particular XML representation of the CAS named the XMI.

What is a Type System? Think of a type system as a schema or class model for the CAS. It defines the types of objects and their properties (or features) that may be instantiated in a CAS. A specific CAS conforms to a particular type system. UIMA components declare their input and output with respect to a type system.

Type Systems include the definitions of types, their properties, range types (these can restrict the value of properties to other types) and single-inheritance hierarchy of types.

What is a Sofa? Sofa stands for “Subject of Analysis". A CAS is associated with a single artifact being analysed by a collection of UIMA analysis engines. But a single artifact may have multiple independent views, each of which may be analyzed separately by a different set of analysis engines. For example, given a document it may have different translations, each of which are associated with the original document but each potentially analyzed by different engines. A CAS may have multiple Views, each containing a different Subject of Analysis corresponding to some version of the original artifact. This feature is ideal for multi-modal analysis, where for example, one view of a video stream may be the video frames and the other the close-captions.

What's the difference between an Annotator and an Analysis Engine? In the terminology of UIMA, an annotator is simply some code that analyzes documents and outputs annotations on the content of the documents. The UIMA framework takes the annotator, together with metadata describing such things as the input requirements and outputs types of the annotator, and produces an analysis engine.

Analysis Engines contain the framework-provided infrastructure that allows them to be easily combined with other analysis engines in different flows and according to different deployment options (collocated or as web services, for example).

Analysis Engines are the framework-generated objects that an Application interacts with. An Annotator is a user-written class that implements the one of the supported Annotator interfaces.

Are UIMA analysis engines web services? They can be deployed as such. Deploying an analysis engine as a web service is one of the deployment options supported by the UIMA framework.

How do you scale a UIMA application? The UIMA framework allows components such as analysis engines and CAS Consumers to be easily deployed as services or in other containers and managed by systems middleware designed to scale. UIMA applications tend to naturally scale-out across documents allowing many documents to be analyzed in parallel.

A component in the UIMA framework called the CPM (Collection Processing Manager) has a host of features and configuration settings for scaling an application to increase its throughput and recoverability.

What does it mean to embed UIMA in systems middleware? An example of an embedding would be the deployment of a UIMA analysis engine as an Enterprise Java Bean inside an application server such as IBM WebSphere. Such an embedding allows the deployer to take advantage of the features and tools provided by WebSphere for achieving scalability, service management, recoverability etc. UIMA is independent of any particular systems middleware, so analysis engines could be deployed on other application servers as well.

Do Analysis Engines have to be "stateless"? This is a user-specifyable option. The XML metadata for the component includes an operationalProperties element which can specify if multiple deployment is allowed. If true, then a particular instance of an Engine might not see all the CASes being processed. If false, then that component will see all of the CASes being processed. In this case, it can accumulate state information among all the CASes. Typically, Analysis Engines in the main analysis pipeline are marked multipleDeploymentAllowed = true. The CAS Consumer comonent, on the other hand, defaults to having this property set to false, and is typically associated with some resource like a database or search engine that aggregates analysis results across an entire collection.

Analysis Engines developers are encouraged not to maintain state between documents that would prevent their engine from working as advertised if operated in a parallelized environment.

Is engine meta-data compatible with web services and UDDI? All UIMA component implementations are associated with Component Descriptors which represents metadata describing various properties about the component to support discovery, reuse, validation, automatic composition and development tooling. In principle, UIMA component descriptors are compatible with web services and UDDI. However, the UIMA framework currently uses its own XML representation for component metadata. It would not be difficult to convert between UIMA’s XML representation and the WSDL and UDDI standards.

How is the CPM different from a CPE? These name complimentary aspects of collection processing. The CPM is the part of the UIMA framework that manages the execution of a workflow of UIMA components orchestrated to analyze a large collection of documents. The UIMA developer does not implement or describe a CPM. It is a piece of infrastructure code that handles CAS transport, instance management, batching, check-pointing, statistics collection and failure recovery in the execution of a collection processing workflow.

A Collection Processing Engine (CPE) is component created by the framework from a specific CPE descriptor. A CPE descriptor refers to a series of UIMA components including a Collection Reader, CAS Initializer, Analysis Engine(s) and CAS Consumers. These components are organized in a work flow and define a collection analysis job or CPE. A CPE acquires documents from a source collection, initializes CASs with document content, performs document analysis and then produces collection level results (e.g., search engine index, database etc). The CPM is the execution engine for a CPE.

What is Semantic Search and what is its relationship to UIMA? Semantic Search refers to a document search paradigm that allows users to search based not just on the keywords contained in the documents, but also on the semantics associated with the text by analysis engines. UIMA applications perform analysis on text documents and generate semantics in the form of annotations on regions of text. For example, a UIMA analysis engine may discover the text “First Financial Bank" to refer to an organization and annotated it as such. With traditional keyword search, the query “first" will return all documents that contain that word. “First" is a frequent and ambiguous term – it occurs a lot and can mean different things in different places. If the user is looking for organizations that contain that word “first" in their names, s/he will likely have to sift through lots of documents containing the word “first" used in different ways. Semantic Search exploits the results of analysis to allow more precise queries. For example, the semantic search query <organization> first </organization> will rank first documents that contain the word “first" as part of the name of an organization. The UIMA SDK documentation demonstrates how UIMA applications can be built using semantic search. It provides details about the XML Fragment Query language. This is the particular query language used by the semantic search engine that comes with the SDK.

Is an XML Fragment Query valid XML? Not necessarily. The XML Fragment Query syntax is used to formulate queries interpreted by the semantic search engine that ships with the UIMA SDK. This query language relies on basic XML syntax as an intuitive way to describe hierarchical patterns of annotations that may occur in a CAS. The language deviates from valid XML in order to support queries over “overlapping" or “cross-over" annotations and other features that affect the interpretation of the query by the query processor. For example, it admits notations in the query to indicate whether a keyword or an annotation is optional or required to match a document.

Does UIMA support modalities other than text? The UIMA architecture supports the development, discovery, composition and deployment of multi-modal analytics including text, audio and video. Applications that process text, speech and video have been developed using UIMA. This release of the SDK, however, does not include examples of these multi-modal applications.

It does however include documentation and programming examples for using the key feature required for building multi-modal applications. UIMA supports multiple subjects of analysis or Sofas. These allow multiple views of a single artifact to be associated with a CAS. For example, if an artifact is a video stream, one Sofa could be associated with the video frames and another with the closed-captions text. UIMA’s multiple Sofa feature is included and described in this release of the SDK.

How does UIMA compare to other similar work? A number of different frameworks for NLP have preceded UIMA. Two of them were developed at IBM Research and represent UIMA’s early roots. For details please refer to the UIMA article that appears in the IBM Systems Journal Vol. 43, No. 3 (http://www.research.ibm.com/journal/sj/433/ferrucci.html).

UIMA has advanced that state of the art along a number of dimensions including: support for distributed deployments in different middleware environments, easy framework embedding in different software product platforms (key for commercial applications), broader architectural converge with its collection processing architecture, support for multiple-modalities, support for efficient integration across programming languages, support for a modern software engineering discipline calling out different roles in the use of UIMA to develop applications, the extensive use of descriptive component metadata to support development tooling, component discovery and composition. (Please note that not all of these features are available in this release of the SDK.)

How does UIMA relate to IBM Products? UIMA analysis engines and annotators are already used within several IBM products, including, IBM's new enterprise search offering, WebSphere Information Integrator OmniFind Edition (http://www.ibm.com/software/data/integration/search.html), and IBM's WebSphere Portal Server offering. All new analysis technology deployed into IBM products is based on the UIMA architecture.

Is UIMA Open Source? Yes. The UIMA SDK is freely available on the IBM alphaWorks site ( http://www.alphaworks.ibm.com/tech/uima ) and the source code for the UIMA framework is available on SourceForge (http://uima-framework.sourceforge.net ).

What Java level and OS are required for the UIMA SDK? The UIMA SDK requires a Java 1.4 level; it will not run on a 1.3 (or earlier levels). It has been tested with IBM Java SDK v1.4.2, which is included as part of the UIMA SDK. It has been tested on Windows 2000, Windows XP and Linux Intel 32bit platforms. Other platforms and JDK implementations, including Java 1.5, may work, but have not been significantly tested.

Can I build my UIM application on top of UIMA? Yes. The UIMA SDK license does not restrict its usage to specific scenarios, and we are of course very interested in your feedback to help us making UIMA the right platform for building UIMA applications. UIMA is officially supported inside IBM's WebSphere Information Integration Omnifind Edition product (http://www.ibm.com/developerworks/db2/zones/db2ii or http://www.ibm.com/software/data/integration/db2ii/editions_womnifind.html). The UIMA SDK on IBM's alphaWorks is supported on a "best can do" basis. If you are interested in a more formal support agreement, or would like to include UIMA in a commercial solution, beyond using the officially supported product, please contact IBM for additional options.