XMI CAS Serialization Reference

This is the specification for the mapping of the UIMA CAS into the XMI (XML Metadata Interchange For details on XMI see Grose et al. Mastering XMI. Java Programming with XMI, XML, and UML. John Wiley & Sons, Inc. 2002. ) format. XMI is an OMG standard for expressing object graphs in XML. The UIMA SDK provides support for XMI through the classes com.ibm.uima.cas.impl.XmiCasSerializer and com.ibm.uima.cas.impl.XmiCasDeserializer.

The outermost tag is <XMI> and must include a version number and XML namespace attribute:

<xmi:XMI xmi:version="2.0" xmlns:xmi=http://www.omg.org/XMI> <!-- CAS Contents here --> </xmi:XMI>

XML namespaceshttp://www.w3.org/TR/xml-names11/ are used throughout. The "xmi" namespace prefix is used to identify elements and attributes that are defined by the XMI specification. The XMI document will also define one namespace prefix for each CAS namespace, as described in the next section.

UIMA Feature Structures are mapped to XML elements. The name of the element is formed from the CAS type name, making use of XML namespaces as follows.

The CAS type namespace is converted to an XML namespace URI by the following rule: replace all dots with slashes, prepend http:///, and append .ecore.

This mapping was chosen because it is the default mapping used by the Eclipse Modeling Framework (EMF) For details on EMF and Ecore see Budinsky et al. Eclipse Modeling Framework 2.0. Addison-Wesley. 2006. to create namespace URIs from Java package names. The use of the http scheme is a common convention, and does not imply any HTTP communication. The .ecore suffix is due to the fact that the recommended type system definition for a namespace is an ECore model, see XMI and EMF Interoperability .

Consider the CAS type name "org.myproj.Foo". The CAS namespace ("org.myorg.") is converted to the XML namespace URI is http:///org/myproj.ecore.

The XML element name is then formed by concatenating the XML namespace prefix (which is an arbitrary token, but typically we use the last component of the CAS namespace) with the type name (excluding the namespace).

So the example "org.myproj.Foo" FeatureStructure is written to XMI as:

<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" xmlns:myproj="http:///org/myproj.ecore"> ... <myproj:Foo xmi:id="1"/> ... </xmi:XMI>

The xmi:id attribute is only required if this object will be referred to from elsewhere in the XMI document. If provided, the xmi:id must be unique for each feature.

All namespace prefixes (e.g. "myproj") in this example must be bound to URIs using the "xmlns..." attribute, as defined by the XML namespaces specification.

CAS features of primitive types (currently String, Integer, or Float, but others are possible) can be mapped either to XML attributes or XML elements. For example, a CAS FeatureStructure of type org.myproj.Foo, with features:

begin = 14 end = 19 myFeature = "bar"

could be mapped to:

<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" xmlns:myproj="http:///org/myproj.ecore"> ... <myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/> ... </xmi:XMI>

or equivalently:

<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI xmlns:myproj="http:///org/myproj.ecore"> ... <myproj:Foo xmi:id="1"> <begin>14</begin> <end>19</end> <myFeature>bar</myFeature> </myproj:Foo> ... </xmi:XMI>

The attribute serialization is preferred for compactness, but either representation is allowable. Mixing the two styles is allowed; some features can be represented as attributes and others as elements.

CAS features that are references to other feature structures (excluding arrays and lists, which are handled separately) are serialized as ID references.

If we add to the previous CAS example a feature structure of type org.myproj.Baz, with feature "myFoo" that is a reference to the Foo object, the serialization would be:

<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" xmlns:myproj="http:///org/myproj.ecore"> ... <myproj:Foo xmi:id="1" begin="14" end="19" myFeature="bar"/> <myproj:Baz xmi:id="2" myFoo="1"/> ... </xmi:XMI>

As with primitive-valued features, it is permitted to use an element rather than an attribute. However, the syntax is slightly different:

<myproj:Baz xmi:id="2"> <myFoo href="#1"/> <myproj.Baz>

Note that in the attribute representation, a reference feature is indistinguishable from an integer-valued feature, so the meaning cannot be determined without prior knowledge of the type system. The element representation is unambiguous.

For a CAS feature whose range type is one of the CAS array or list types, the XMI serialization depends on the setting of the "multipleReferencesAllowed" attribute for that feature in the UIMA Type System Description (see Features .

An array or list with multipleReferencesAllowed = false (the default) is serialized as a "multi-valued" property in XMI. An array or list with multipleReferencesAllowed = true is serialized as a first-class object. Details are described below.

Arrays and Lists as Multi-Valued Properties

In XMI, a multi-valued property is the most natural XMI representation for most cases. Consider the example where the FeatureStructure of type org.myproj.Baz has a feature myIntArray whose value is the integer array {2,4,6}. This can be mapped to:

<myproj:Baz xmi:id="3" myIntArray="2 4 6"/>

or equivalently:

<myproj:Baz xmi:id="3"> <myIntArray>2</myIntArray> <myIntArray>4</myIntArray> <myIntArray>6</myIntArray> </myproj:Baz>

Note that String arrays whose elements contain embedded spaces MUST use the latter mapping.

FSArray or FSList features are serialized in a similar way. For example an FSArray feature that contains references to the elements with xmi:id's "13" and "42" could be serialized as:

<myproj:Baz xmi:id="3" myFsArray="13 42"/>

or:

<myproj:Baz xmi:id="3"> <myFsArray href="#13"/> <myFsArray href="#42"/> </myproj:Baz>

Arrays and Lists as First-Class Objects

The multi-valued-property representation described in the previous section does not allow multiple references to an array or list object. Therefore, it cannot be used for features that are defined to allow multiple references (i.e. features for which multipleReferencesAllowed = true in the Type System Description).

When multipleReferencesAllowed is set to true, array and list features are serialized as references, and the array or list objects are serialized as separate objects in the XMI. Consider again the example where the FeatureStructure of type org.myproj.Baz has a feature myIntArray whose value is the integer array {2,4,6}. If myIntArray is defined with multipleReferencesAllowed=true, the serialization will be as follows:

<myproj:Baz xmi:id="3" myIntArray="4"/>

or:

<myproj:Baz xmi:id="3"> <myIntArray href="#4"/> </myproj:Baz>

with the array object serialized as:

<cas:IntegerArray xmi:id="4" elements="2 4 6"/>

or:

<cas:IntegerArray xmi:id="4"> <elements>2</elements> <elements>4</elements> <elements>6</elements> </cas:IntegerArray>

Note that in this case, the XML element name is formed from the CAS type name (e.g. "uima.cas.IntegerArray") in the same way as for other FeatureStructures. The elements of the array are serialized either as a space-separated attribute named "elements" or as a series of child elements named "elements".

List nodes are just standard FeatureStructures with "head" and "tail" features, and are serialized using the normal FeatureStructure serialization. For example, an IntegerList with the values 2, 4, and 6 would be serialized as the four objects:

<cas:NonEmptyIntegerList xmi:id="10" head="2" tail="11"/> <cas:NonEmptyIntegerList xmi:id="11" head="4" tail="12"/> <cas:NonEmptyIntegerList xmi:id="12" head="6" tail="13"/> <cas:EmptyIntegerList xmi:id"13"/>

This representation of arrays allows multiple references to an array of list. It also allows a feature with range type TOP to refer to an array or list. However, it is a very unnatural representation in XMI and does not support interoperability with other XMI-based systems, so we instead recommend using the multi-valued-property representation described in the previous section whenever it is possible.

In UIMA, an element of an FSArray or FSList may be null. In XMI, multi-valued properties do not permit null values. As a workaround for this, we will use a dummy instance of the special type cas:NULL, which has xmi:id 0. For example, in the following example the "myFsArray" feature refers to an FSArray whose second element is null:

<cas:NULL xmi:id="0"/> <myproj:Baz xmi:id="3"> <myFsArray href="#13"/> <myFsArray href="#0"/> <myFsArray href="#42"/> </myproj:Baz>

A UIMA CAS contain one or more subjects of analysis (Sofas). These are serialized no differently from any other feature structure. For example:

<?xml version="1.0" encoding="ASCII"?> <xmi:XMI xmi:version="2.0" xmlns:xmi=http://www.omg.org/XMI xmlns:cas="http:///uima/cas.ecore"> <cas:Sofa xmi:id="1" sofaNum="1" text="the quick brown fox jumps over the lazy dog."/> </xmi:XMI>

Each Sofa defines a separate View. Feature Structures in the CAS can be members of one or more views. (A Feature Structure that is a member of a view is indexed in its IndexRepository, but that is an implementation detail.)

In the XMI serialization, views will be represented as first-class objects. Each View has an (optional) "sofa" feature, which references a sofa, and multi-valued reference to the members of the View. For example:

<cas:View sofa="1" members="3 7 21 39 61"/>

Here the integers 3, 7, 21, 39, and 61 refer to the xmi:id fields of the objects that are members of this view.

If the sofa feature is omitted, then this is interpreted as the "base" view, whose members pertain to the artifact as a whole rather than any individual Sofa.

If the CAS Type System has been saved to an Ecore file (which is the subject of a different spec), it is possible to store a link from an XMI document to that Ecore type system. This is done using an xsi:schemaLocation attribute on the root XMI element.

The xsi:schemaLocation attribute is a space-separated list that represents a mapping from namespace URI (e.g. http:///org/myproj.ecore) to the physical URI of the .ecore file containing the type system for that namespace. For example:

xsi:schemaLocation= "http:///org/myproj.ecore file:/c:/typesystems/myproj.ecore"

would indicate that the definition for the org.myproj CAS types is contained in the file c:/typesystems/myproj.ecore. You can specify a different mapping for each of your CAS namespaces, using a space separated list. For details see Budinsky et al. Eclipse Modeling Framework.