Component Descriptor Reference

This chapter is the reference guide for the UIMA SDK's Component Descriptor XML schema. A Component Descriptor (also sometimes called a Resource Specifier in the code) is an XML file that either (a) completely describes a component, including all information needed to construct the component and interact with it, or (b) specifies how to connect to and interact with an existing component that has been published as a remote service. Component (also called Resource) is a general term for modules produced by UIMA developers and used by UIMA applications. The types of Components are: Analysis Engines, Collection Readers, CAS Initializers, CAS Consumers, and Collection Processing Engines. However, Collection Processing Engine Descriptors are significantly different in format and are covered in a separate chapter, UIMA Collection Processing Engine Descriptor Reference.

Section 23.1 describes the notation used in this chapter.

Section 23.2 describes the UIMA SDK’s import syntax, used to allow XML descriptors to import information from other XML files, to allow sharing of information between several XML descriptors.

Section 23.4 describes the XML format for Analysis Engine Descriptors. These are descriptors that completely describe Analysis Engines, including all information needed to construct and interact with them.

Section 23.6 describes the XML format for Collection Processing Component Descriptors. This includes Collection Iterator, CAS Initializer, and CAS Consumer Descriptors.

Section 23.7 describes the XML format for Service Client Descriptors, which specify how to connect to and interact with resources deployed as remote services.

This chapter uses an informal notation to specify the syntax of Component Descriptors. The formal syntax is defined by an XML schema definition, which is contained in two files – resourceSpecifierSchema.xsd and TaeSpecifierSchema.xsd, both of which are in the uima_core.jar file.

The notation used in this chapter is:

An ellipsis (...) inside an element body indicates that the substructure of that element has been omitted (to be described in another section of this chapter). An example of this would be:

...

</analysisEngineMetaData>

An ellipsis immediately after an element indicates that the element type may be may be repeated arbitrarily many times. For example:

<parameter>[String]</parameter> <parameter>[String]</parameter> ...

indicates that there may be arbitrarily many parameter elements in this context.

Bracketed expressions (e.g. [String]) indicate the type of value that may be used at that location.
A vertical bar, as in true|false, indicates alternatives. This can be applied to literal values, bracketed type names, and elements.
Which elements are optional and which are required is specified in prose, not in the syntax definition.

The UIMA SDK defines a particular syntax for XML descriptors to import information from other XML files. When one of the following appears in an XML descriptor:

<import location="[URL]" /> or

it indicates that information from a separate XML file is being imported. Note that imports are allowed only in certain places in the descriptor. In the remainder of this chapter, it will be indicated at which points imports are allowed.

If an import specifies a location attribute, the value of that attribute specifies the URL at which the XML file to import will be found. This can be a relative URL, which will be resolved relative to the descriptor containing the import element, or an absolute URL. Relative URLs can be written without a protocol/scheme (e.g., "file:"), and without a host machine name. In this case the relative URL might look something like com/ibm/myproj/MyTypeSystem.xml.

An absolute URL is written with one of the following prefixes, followed by a path such as com/ibm/myproj/MyTypeSystem.xml:

file:/ << has no network address
file:/// << has an empty network address
file://some.network.address/

For more information about URLs, please read the javadoc information for the Java class "URL".

If an import specifies a name attribute, the value of that attribute should take the form of a Java-style dotted name (e.g. com.ibm.myproj.MyTypeSystem). An .xml file with this name will be searched for in the classpath or datapath (described below). As in Java, the dots in the name will be converted to file path separators. So an import specifying the example name in this paragraph will result in a search for com/ibm/myproj/MyTypeSystem.xml in the classpath or datapath.

The datapath works similarly to the classpath but can be set programmatically through the resource manager API. Application developers can specify a datapath during initialization, using the following code:

ResourceManager resMgr = UIMAFramework.newDefaultResourceManager(); resMgr.setDataPath(yourPathString); AnalysisEngine ae = UIMAFramework.produceAE(desc, resMgr, null);

The default datapath for the entire JVM can be set via the uima.datapath Java system property, but this feature should only be used for standalone applications that don't need to run in the same JVM as other code that may need a different datapath.

The UIMA SDK also supports XInclude, a W3C candidate recommendation, to include XML files within other XML files. However, it is recommended that the import syntax be used instead, as it is more flexible and better supports tool developers.

UIMA tools for editing XML descriptors do not support the use of xi:include because they cannot correctly determine what parts of a descriptor are updatable, and what parts are included from other files. They do support the use of <import>.

To use XInclude, you first must include the XInclude namespace in your document’s root element, e.g.:

Then, you can include a file using the syntax <xi:include href="[URL]"/>

where [URL] can be any relative or absolute URL referring to another XML document. The referred-to document must be a valid XML document, meaning that it must consist of exactly one root element and must define all of the namespace prefixes that it uses. The default namespace (generally http://uima.apache.org/resourceSpecifier) will be inherited from the parent document. When UIMA parses the XML document, it will automatically replace the <xi:include> element with the entire XML document referred to by the href. For more information on XInclude see http://www.w3.org/TR/xinclude/.

A Type System Descriptor is used to define the types and features that can be represented in the CAS. A Type System Descriptor can be imported into an Analysis Engine or Collection Processing Component Descriptor.

The basic structure of a Type System Descriptor is as follows:

<name> [String] </name> <description>[String]</description> <version>[String]</version> <vendor>[String]</vendor>

...

</types>

</typeSystemDescription>

All of the subelements are optional.

Imports

The imports section allows this descriptor to import types from other type system descriptors. The import syntax is described in section 23.1 of this chapter. A type system may import any number of other type systems and then define additional types which refer to imported types. Circular imports are allowed.

Types

The types element contains zero or more typeDescription elements. Each typeDescription has the form:

<typeDescription> <name>[TypeName]</name> <description>[String]</description> <supertypeName>[TypeName]</supertypeName> <features> ... </features> </typeDescription>

The name element contains the name of the type. A [TypeName] is a dot-separated list of names, where each name consists of a letter followed by any number of letters, digits, or underscores. TypeNames are case sensitive. Letter and digit are as defined by Java; therefore, any Unicode letter or digit may be used (subject to the character encoding defined by the descriptor file's XML header). The name following the final dot is considered to be the "short name" of the type; the preceding portion is the namespace (analogous to the package.class syntax used in Java). Namespaces beginning with uima are reserved and should not be used. Examples of valid type names are:

test.TokenAnnotation
org.myorg.tae.TokenAnnotation
com.my_company.proj123.TokenAnnotation

These would all be considered distinct types since they have different namespaces. Best practice here is to follow the normal Java naming conventions of having namespaces be all lowercase, with the short type names having an initial capital, but this is not mandated, so ABC.mYtyPE is an allowed type name. While type names without namespaces (e.g. TokenAnnotation alone) are allowed, the JCas does not support them and so their use is strongly discouraged.

The description element contains a textual description of the type. The superTypeName element contains the name of the type from which it inherits (this can be set to the name of another user-defined type, or it may be set to any built-in type which may be subclassed, such as "uima.tcas.Annotation" for a new annotation type or "uima.cas.TOP" for a new type that is not an annotation). All three of these elements are required.

Features

The features element of a typeDescription is required only if the type we are specifying introduces new features. If the features element is present, it contains zero or more featureDescription elements, each of which has the form:

<featureDescription> <name>[Name]</name> <description>[String]</description> <rangeTypeName>[Name]</rangeTypeName> <elementType>[Name]</elementType> <multipleReferencesAllowed>true|false</multipleReferencesAllowed> </featureDescription>

A feature’s name follows the same rules as a type short name – a letter followed by any number of letters, digits, or underscores. Feature names are case sensitive.

The feature’s rangeTypeName specifies the type of value that the feature can take. This may be the name of any type defined in your type system, or one of the predefined types. All of the predefined types have names that are prefixed with uima.cas or uima.tcas, for example:

uima.cas.TOP uima.cas.String uima.cas.Boolean uima.cas.Byte uima.cas.Short uima.cas.Long uima.cas.Float uima.cas.Double uima.cas.FSArray uima.cas.StringArray uima.cas.BooleanArray uima.cas.ByteArray uima.cas.ShortArray uima.cas.IntegerArray uima.cas.LongArray uima.cas.FloatArray uima.cas.DoubleArray uima.cas.FSList uima.cas.StringList uima.cas.IntegerList uima.cas.FloatList uima.tcas.Annotation.

For a complete list of predefined types, see the CAS API documentation.

The elementType of a feature is optional, and applies only when the rangeTypeName is uima.cas.FSArray or uima.cas.FSList The elementType specifies what type of value can be assigned as an element of the array or list. This must be the name of a non-primitive type. If omitted, it defaults to uima.cas.TOP, meaning that any FeatureStructure can be assigned as an element the array or list. Note: depending on the CAS Interface that you use in your code, this constraint may or may not be enforced.

The multipleReferencesAllowed feature is optional, and applies only when the rangeTypeName is an array or list type (it applies to arrays and lists of primitive as well as non-primitive types). Setting this to false (the default) indicates that this feature has exclusive ownership of the array or list, so changes to the array or list are localized. Setting this to true indicates that the array or list may be shared, so changes to it may affect other objects in the CAS. Note: there is currently no guarantee that the framework will enforce this restriction. However, this setting may affect how the CAS is serialized.

String Subtypes

There is one other special type that you can declare – a subset of the String type that specifies a restricted set of allowed values. This is useful for features that can have only certain String values, such as parts of speech. Here is an example of how to declare such a type:

<typeDescription> <name>PartOfSpeech</name> <description>A part of speech.</description> <supertypeName>uima.cas.String</supertypeName> <allowedValues> <value> <string>NN</string> <description>Noun, singular or mass.</description> </value> <value> <string>NNS</string> <description>Noun, plural.</description> </value> <value> <string>VB</string> <description>Verb, base form.</description> </value>

...

</allowedValues> </typeDescription>

Analysis Engine (AE) descriptors completely describe Analysis Engines. There are two basic types of Analysis Engines – Primitive and Aggregate. A Primitive Analysis Engine is a container for a single annotator, where as an Aggregate Analysis Engine is composed of a collection of other Analysis Engines. (For more information on this and other terminology, see Chapter 2, UIMA Conceptual Overview)

Both Primitive and Aggregate Analysis Engines have descriptors, and the two types of descriptors have some similarities and some differences. Primitive Analysis Engine descriptors are discussed first, in Section 23.4.1 . Section 23.4.2 then describes how Aggregate Analysis Engine descriptors are different.

Primitive Analysis Engine Descriptors

Basic Structure

<?xml version="1.0" encoding="UTF-8" ?> <analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier"> <frameworkImplementation>com.ibm.uima.java</frameworkImplementation>

<primitive>true</primitive> <annotatorImplementationName> [String] </annotatorImplementationName>

</analysisEngineDescription>

The document begins with a standard XML header. The recommended root tag is <analysisEngineDescription>, although <taeDescription> is also allowed for backwards compatibility.

Within the root element we declare that we are using the XML namespace http://uima.apache.org/resourceSpecifier. It is required that this namespace be used; otherwise, the descriptor will not be able to be validated for errors.

The first subelement, <frameworkImplementation>, currently must have the value com.ibm.uima.java, or com.ibm.uima.cpp. In future versions, there may be other framework implementations, or perhaps implementations produced by other vendors.

The second subelement, <primitive>, contains the Boolean value true, indicating that this XML document describes a Primitive Analysis Engine.

The next subelement, <annotatorImplementationName> is how the UIMA framework determines which annotator class to use. This should contain a fully-qualified Java class name for Java implementations, or the name of a .dll or .so file for C++ implementations.

The <analysisEngineMetaData> object contains descriptive information about the analysis engine and what it does. It is described in the section Analysis Engine Metadata.

The <externalResourceDependencies> and <resourceManagerConfiguration> elements declare the external resource files that the analysis engine relies upon. They are optional and are described in the section External Resource Dependencies and Resource Manager Configuration.

Analysis Engine Metadata

<analysisEngineMetaData> <name> [String] </name> <description>[String]</description> <version>[String]</version> <vendor>[String]</vendor>

</analysisEngineMetaData>

The analysisEngineMetaData element contains four simple string fields – name, description, version, and vendor. Only the name field is required, but providing values for the other fields is recommended. The name field is just a descriptive name meant to be read by users; it does not need to be unique across all Analysis Engines.

The other sub-elements – configurationParameters, configurationParameterSettings, typeSystemDescription, typePriorities, fsIndexes, capabilities and operationalProperties are described in the following sections. The only one of these that is required is capabilities; the others are all technically optional but generally necessary for an analysis engine of any complexity.

Configuration Parameter Declaration

Configuration Parameters are made available to annotator implementations and applications by the following interfaces: AnnotatorContext (passed as an argument to the initialize() method of an annotator), ConfigurableResource (every Analysis Engine implements this interface), and the UimaContext (you can get this from any resource, including Analysis Engines, using the method getUimaContext()).

Use AnnotatorContext within annotators and UimaContext outside of annotators (for instance, in CasConsumers, or the containing application) to access configuration parameters.

Configuration parameters are set from the corresponding elements in the XML descriptor for the application. If you need to programmatically change parameter settings within an application, you can use methods in ConfigurableResource; if you do this, you need to call reconfigure() afterwards to have the UIMA framework notify all the contained analysis components that the parameter configuration has changed (the analysis engine's reinitialize() methods will be called). Note that in the current implementation, only integrated deployment components have configuration parameters passed to them; remote components obtain their parameters from their remote startup environment. This will likely change in the future.

There are two ways to specify the <configurationParameters> section – as a list of configuration parameters or a list of groups. A list of parameters, which are not part of any group, looks like this:

<configurationParameters> <configurationParameter> <name>[String]</name> <description>[String]</description> <type>String|Integer|Float|Boolean</type> <multiValued>true|false</multiValued> <mandatory>true|false</mandatory> <overrides> <parameter>[String]</parameter> <parameter>[String]</parameter> ... </overrides> </configurationParameter> <configurationParameter> ... </configurationParameter> ... </configurationParameters>

For each configuration parameter, the following are specified:

name – the name by which the annotator code refers to the parameter. All parameters declared in an analysis engine descriptor must have distinct names. (required). The name is composed of normal Java identifier characters.
description – a natural language description of the intent of the parameter (optional)
type – the data type of the parameter's value – must be one of String, Integer, Float, or Boolean (required).
multiValued – true if the parameter can take multiple-values (an array), false if the parameter takes only a single value (optional, defaults to false).
mandatory – true if a value must be provided for the parameter (optional, defaults to false).
overrides – this is used only in aggregate Analysis Engines, but is included here for completeness. See Configuration Parameter Overrides for a discussion of configuration parameter overriding in aggregate Analysis Engines. (optional)

A list of groups looks like this:

<commonParameters> [zero or more parameters] </commonParameters>

<configurationGroup names="name1 name2 name3 ..."> [zero or more parameters] </configurationGroup>

<configurationGroup names="name4 name5 ..."> [zero or more parameters] </configurationGroup>

...

</configurationParameters>

Both the <commonParameters> and <configurationGroup> elements contain zero or more <configurationParameter> elements, with the same syntax described above.

The <commonParameters> element declares parameters that exist in all groups. Each <configurationGroup> element has a names attribute, which contains a list of group names separated by whitespace (space or tab characters). Names consist of any number of non-whitespace characters; however the Component Description Editor tool restricts this to be normal Java identifiers, including the period (.) and the dash (-). One configuration group will be created for each name, and all of the groups will contain the same set of parameters.

The defaultGroup attribute specifies the name of the group to be used in the case where an annotator does a lookup for a configuration parameter without specifying a group name. It may also be used as a fallback if the annotator specifies a group that does not exist – see below.

The searchStrategy attribute determines the action to be taken when the context is queried for the value of a parameter belonging to a particular configuration group, if that group does not exist or does not contain a value for the requested parameter. There are currently three possible values:

none – there is no fallback; return null if there is no value in the exact group specified by the user.
default_fallback – if there is no value found in the specified group, look in the default group (as defined by the default attribute)
language_fallback – this setting allows for a specific use of configuration parameter groups where the groups names correspond to ISO language and country codes (for an example, see below). The fallback sequence is: <lang>_<country>_<region> -> <lang>_<country> -> <lang> -> <default>.

Example

<commonParameters> <configurationParameter> <name>DictionaryFile</name> <description>Location of dictionary for this language</description> <type>String</type> <multiValued>false</multiValued> <mandatory>false</mandatory> </configurationParameter> </commonParameters> <configurationGroup names="en de en-US"/> <configurationGroup names="zh"> <configurationParameter> <name>DBC_Strategy</name> <description>Strategy for dealing with double-byte characters.</description> <type>String</type> <multiValued>false</multiValued> <mandatory>false</mandatory> </configurationParameter> </configurationGroup> </configurationParameters>

In this example, we are declaring a DictionaryFile parameter that can have a different value for each of the languages that our TAE supports – English (general), German, U.S. English, and Chinese. For Chinese only, we also declare a DBC_Strategy parameter.

We are using the language_fallback search strategy, so if an annotator requests the dictionary file for the en-GB (British English) group, we will fall back to the more general en group.

Since we have defined en as the default group, this value will be returned if the context is queried for the DictionaryFile parameter without specifying any group name, or if a nonexistent group name is specified.

Configuration Parameter Settings

If no configuration groups were declared, the <configurationParameterSettings> element looks like this:

<configurationParameterSettings> <nameValuePair> <name>[String]</name> <value> <string>[String]</string> | <integer>[Integer]</integer> | <float>[Float]</float> | <boolean>true|false</boolean> | <array> ... </array> </value> </nameValuePair> <nameValuePair> ... </nameValuePair> ... </configurationParameterSettings>

There are zero or more nameValuePair elements. Each nameValuePair contains a name (which refers to one of the configuration parameters) and a value for that parameter.

The value element contains an element that matches the type of the parameter. For single-valued parameters, this is either <string>, <integer>, <float>, or <boolean>. For multi-valued parameters, this is an <array> element, which then contains zero or more instances of the appropriate type of primitive value, e.g.:

If configuration groups were declared, then the <configurationParameterSettings> element looks like this:

<settingsForGroup name="[String]"> [one or more <nameValuePair> elements] </settingsForGroup>

...

</configurationParameterSettings>

where each <settingsForGroup> element has a name that matches one of the configuration groups declared under the <configurationParameters> element and contains the parameter settings for that group.

Example

Here are the settings that correspond to the parameter declarations in the previous example:

<settingsForGroup name="en"> <nameValuePair> <name>DictionaryFile</name> <value><string>resourcesEnglishdictionary.dat></string></value> </nameValuePair> </settingsForGroup>

<settingsForGroup name="en-US"> <nameValuePair> <name>DictionaryFile</name> <value><string>resourcesEnglish_USdictionary.dat</string></value> </nameValuePair> </settingsForGroup>

<settingsForGroup name="de"> <nameValuePair> <name>DictionaryFile</name> <value><string>resourcesDeutschdictionary.dat</string></value> </nameValuePair> </settingsForGroup>

<settingsForGroup name="zh"> <nameValuePair> <name>DictionaryFile</name> <value><string>resourcesChinesedictionary.dat</string></value> </nameValuePair>

<nameValuePair> <name>DBC_Strategy</name> <value><string>default</string></value> </nameValuePair>

</settingsForGroup>

</configurationParameterSettings>

Type System Definition

<name> [String] </name> <description>[String]</description> <version>[String]</version> <vendor>[String]</vendor>

...

</types>

</typeSystemDescription>

A typeSystemDescription element defines a type system for an Analysis Engine. The syntax for the element is described in section 23.3 of this chapter.

The recommended usage is to import an external type system, using the import syntax described in section 23.1 of this chapter. For example:

This allows several AEs to share a single type system definition. The file MySharedTypeSystem.xml would then contain the full type system information, including the name, description, vendor, version, and types.

Type Priority Definition

<typePriorities> <name> [String] </name> <description>[String]</description> <version>[String]</version> <vendor>[String]</vendor>

<priorityLists> <priorityList> <type>[TypeName]</type> <type>[TypeName]</type> ... </priorityList>

...

</priorityLists> </typePriorities>

The <typePriorities> element contains zero or more <priorityList> elements; each <priorityList> contains zero or more types. Like a type system, a type priorities definition may also declare a name, description, version, and vendor, and may import other type priorities. The import syntax is described in section 23.1 of this chapter.

Type priority is used when iterating over feature structures in the CAS. For example, if the CAS contains a Sentence annotation and a Paragraph annotation with the same span of text (i.e. a one-sentence paragraph), which annotation should be returned first by an iterator? Probably the Paragraph, since it is conceptually "bigger," but the framework does not know that and must be explicitly told that the Paragraph annotation has priority over the Sentence annotation, like this:

<typePriorities> <priorityList> <type>org.myorg.Paragraph</type> <type>org.myorg.Sentence</type> </priorityList> </typePriorities>

All of the <priorityList> elements defined in the descriptor (and in all component descriptors of an aggregate analysis engine descriptor) are merged to produce a single priority list.

Subtypes of types specified here are also ordered, unless overridden by another user-specified type ordering. For example, if you specify type A comes before type B, then subtypes of A will come before subtypes of B, unless there is an overriding specification which declares some subtype of B comes before some subtype of A.

If there are inconsistencies between the priority list (type A declared before type B in one priority list, and type B declared before type A in another), the framework will throw an exception.

User defined indexes may declare if they wish to use the type priority or not; see the next section.

Index Definition

<name>[String]</name> <description>[String]</description> <version>[String]</version> <vendor>[String]</vendor>

</fsIndexes>

</fsIndexCollection>

The fsIndexCollection element declares Feature Structure Indexes, which define an index that holds feature structures of a given type. Information in the CAS is always accessed through an index. There is a built-in default annotation index declared which can be used to access instances of type Annotation (or its subtypes), but if there is a need for a specialized index it must be declared in this element. See Chapter 26, CAS Reference for details on FS indexes.

Like type systems and type priorities, an fsIndexCollection can declare a name, description, vendor, and version, and may import other fsIndexCollections. The import syntax is described in section 23.1 of this chapter.

An fsIndexCollection may also define zero or more fsIndexDescription elements, each of which defines a single index. Each fsIndexDescription has the form:

<label>[String]</label> <typeName>[TypeName]</typeName> <kind>sorted|bag|set</kind>

<keys>

<fsIndexKey> <featureName>[Name]</featureName> <comparator>standard|reverse</comparator> </fsIndexKey>

...

</keys> </fsIndexDescription>

The label element defines the name by which applications and annotators refer to this index. The typeName element contains the name of the type that will be contained in this index. This must match one of the type names defined in the <typeSystemDescription>.

There are three possible values for the <kind> of index. Sorted indexes enforce an ordering of feature structures, and may contain duplicates. Bag indexes do not enforce ordering, and also may contain duplicates. Set indexes do not enforce ordering and may not contain duplicates. If the <kind>element is omitted, it will default to sorted, which is the most common type of index.

An index may define one or more keys. These keys determine the sort order of the feature structures within a sorted index, and determine equality for set indexes. Bag indexes do not use keys. Keys are ordered by precedence – the first key is evaluated first, and subsequent keys are evaluated only if necessary.

Each key is represented by an fsIndexKey element. Most fsIndexKeys contains a featureName and a comparator. The featureName must match the name of one of the features for the type specified in the <typeName> element for this index. The comparator defines how the features will be compared – a value of standard means that features will be compared using the standard comparison for their data type (e.g. for numerical types, smaller values precede larger values, and for string types, Unicode string comparison is performed). A value of reverse means that features will be compared using the reverse of the standard comparison (e.g. for numerical types, larger values precede smaller values, etc.). For Set indexes, the comparator direction is ignored – the keys are only used for the equality testing.

Each key used in comparisons must refer to a feature whose range type is String, Float, or Integer.

There is a second type of a key, one which contains only the <typePriority/>. When this key is used, it indicates that Feature Structures will be compared using the type priorities declared in the <typePriorities> section of the descriptor.

Capabilities

<inputs> <type allAnnotatorFeatures="true|false">[TypeName]</type> ... <feature>[TypeName]:[Name]</feature> ... </inputs>

<outputs> <type allAnnotatorFeatures="true|false">[TypeName]</type> ... <feature>[TypeName]:[Name]</feature> ... </output>

<languagesSupported> <language>[ISO Language ID]</language> ... </languagesSupported>

...

</capabilities>

The capabilities definition is used by the UIMA Framework in several ways, including setting up the Results Specification for process calls, routing control for aggregates based on language, and as part of the Sofa mapping function.

The capabilities element contains one or more capability elements. Because you can therefore declare multiple capability sets, you can use this to model component behavior that for a given set of inputs, produces a particular set of outputs.

Each capability contains inputs, outputs, languagesSupported, inputSofas, and outputSofas. Inputs and outputs element are required (though they may be empty); <languagesSupported>, <inputSofas>, and <outputSofas> are optional and is used only used for TAEs.

Both inputs and outputs may contain a mixture of type and feature elements.

<type...> elements contain the name of one of the types defined in the type system or one of the built in types. Declaring a type as an input means that this component expects instances of this type to be in the CAS when it receives it to process. Declaring a type as an output means that this component creates new instances of this type in the CAS.

There is an optional attribute allAnnotatorFeatures, which defaults to false if omitted. The Component Descriptor Editor tool defaults this to true when a new type is added to the list of inputs and/or outputs. When this attribute is true, it specifies that all of the type’s features are also declared as input or output. Otherwise, the features that are required as inputs or populated as outputs must be explicitly specified in feature elements.

<feature...> elements contain the "fully-qualified" feature name, which is the type name followed by a colon, followed by the feature name, e.g. org.myorg.tae.TokenAnnotation:lemma. <feature...> elements in the <inputs> section must also have a corresponding type declared as an input. In output sections, this is not required. If the type is not specified as an output, but a feature for that type is, this means that existing instances of the type have the values of the specified features updated. Any type mentioned in a <feature> element must be either specified as an input or an output or both.

language elements contain one of the ISO language identifiers, such as en for English, or en-US for the United States dialect of English.

The list of language codes can be found here:

http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt

and the country codes here:

http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html

<inputSofas> and <outputSofas> declare sofa names used by this component. All Sofa names must be unique within a particular capability set. A Sofa name must be an input or an output, and cannot be both. It is an error to have a Sofa name declared as an input in one capability set, and also have it declared as an output in another capability set.

A <sofaName> is written as a simple Java-style identifier, without any periods in the name, except that it may be written to end in .* . If written in this manner, it specifies a set of Sofa names, all of which start with the base name (the part before the .*) followed by a period and then an arbitrary Java identifier (without periods). This form is used to specify in the descriptor that the component could generate an arbitrary number of Sofas, the exact names and numbers of which are unknown before the component is run.

OperationalProperties

Components can specify specific operational properties that can be useful in deployment. The following are available:

<operationalProperties> <modifiesCas>true|false </modifiesCas> <multipleDeploymentAllowed>true|false </multipleDeploymentAllowed> < outputsNewCASes> true|false </outputsNewCASes> </operationalProperties>

ModifiesCas, if false, indicates that this component does not modify the CAS. If it is not specified, the default value is true except for CAS Consumer components.

If you wrap one or more CAS Consumers inside an aggregate as the only components, you must explicitly specify in the aggregate the ModifiesCas property as false (assuming the CAS Consumer components take the default here); otherwise the framework will complain about inconsistent settings for these.

multipleDeploymentAllowed, if true, allows the component to be deployed multiple times to increase performance throught scale-out techniques. If it is not specified, the default value is true, except for CAS Consumer and Collection Reader components.

outputsNewCASes, if true, allows the component to create new CASes during processing, for example to break a large artifact into smaller pieces. See CAS Multiplier Developer's Guide for details.

External Resource Dependencies

<externalResourceDependencies> <externalResourceDependency> <key>[String]</key> <description>[String] </description> <interfaceName>[String]</interfaceName> <optional>true|false</optional> </externalResourceDependency>

...

</externalResourceDependencies>

A primitive annotator may declare zero or more <externalResourceDependency> elements. Each dependency has the following elements:

key – the string by which the annotator code will attempt to access the resource. Must be unique within this annotator.
description – a textual description of the dependency
interfaceName – the fully-qualified name of the Java interface through which the annotator will access the data. This is optional. If not specified, the annotator can only get an InputStream to the data.
optional – whether the resource is optional. If false, an exception will be thrown if no resource is assigned to satisfy this dependency. Defaults to false.

Resource Manager Configuration

<name>[String]</name> <description>[String]</description> <version>[String]</version> <vendor>[String]</vendor>

<externalResource> <name>[String]</name> <description>[String]</description> <fileResourceSpecifier> <fileUrl>[URL]</fileUrl> </fileResourceSpecifier> <implementationName>[String]</implementationName> </externalResource> ... </externalResources>

<externalResourceBindings> <externalResourceBinding> <key>[String]</key> <resourceName>[String]</resourceName> </externalResourceBinding> ... </externalResourceBindings>

</resourceManagerConfiguration>

This element declares external resources and binds them to annotators’ external resource dependencies.

The resourceManagerConfiguration element may optionally contain an import, which allows resource definitions to be stored in a separate (shareable) file. See section 23.2 for details.

The externalResources element contains zero or more externalResource elements, each of which consists of:

name – the name of the resource. This name is referred to in the bindings (see below). Resource names need to be unique within any Aggregate Analysis Engine or Collection Processing Engine, so the Java-like org.myorg.mycomponent.MyResource syntax is recommended.
description – English description of the resource
resource specifier – Declares the location of the resource. There are different possibilities for how this is done (see below).
implementationName – The fully-qualified name of the Java class that will be instantiated from the resource data. This is optional; if not specified, the resource will be accessible as an input stream to the raw data. If specified, the Java class must implement the interfaceName that is specified in the External Resource Dependency to which it is bound.

One possibility for the resource specifier is a <fileResourceSpecifier>, as shown above. This simply declares a URL to the resource data. This support is built on the Java class URL and its method URL.openStream(); it supports the protocols "file", "http" and "jar" (for referring to files in jars) by default, and you can plug in handlers for other protocols. The URL has to start with file: (or some other protocol). It is relative to either the classpath or the "data path". The data path works like the classpath but can be set programmatically via ResourceManager.setDataPath(). Setting the Java System property uima.datapath also works.

file:com/ibm.d.txt is a relative path; relative paths for resources are resolved using the classpath and / or the datapath. For the file protocol, URLs starting with file:/ or file:/// are absolute. Note that file://com/ibm/d.txt is NOT an absolute path starting with com. The '//' indicates that what follows is a host name. Therefore if you try to use this URL it will complain that it can't connect to the host "com"

Another option is a <fileLanguageResourceSpecifier>, which is intended to support resources, such as dictionaries, that depend on the language of the document being processed. Instead of a single URL, a prefix and suffix are specified, like this:

<fileLanguageResourceSpecifier> <fileUrlPrefix>file:FileLanguageResource_implTest_data_</fileUrlPrefix> <fileUrlSuffix>.dat</fileUrlSuffix> </fileLanguageResourceSpecifier>

The URL of the actual resource is then formed by concatenating the prefix, the language of the document (as an ISO language code, e.g. en or en-US – see Capabilities for more information), and the suffix.

The externalResourceBindings element declares which resources are bound to which dependencies. Each externalResourceBinding consists of:

key – identifies the dependency. For a binding declared in a primitive analysis engine descriptor, this must match the value of the key element of one of the externalResourceDependency elements. Bindings may also be specified in aggregate analysis engine descriptors, in which case a compound key is used – see section External Resource Bindings.
resourceName – the name of the resource satisfying the dependency. This must match the value of the name element of one of the externalResource declarations.

A given resource dependency may only be bound to one external resource; one external resource may be bound to many dependencies – to allow resource sharing.

Environment Variable References

In several places throughout the descriptor, it is possible to reference environment variables. In Java, these are actually references to Java system properties. To reference system environment variables from a Java analysis engine you must pass the environment variables into the Java virtual machine by using the -D option on the java command line.

The syntax for environment variable references is <envVarRef>[VariableName]</envVarRef>, where [VariableName] is any valid Java system property name. Environment variable references are valid in the following places:

The value of a configuration parameter (String-valued parameters only)
The <annotatorImplementationName> element of a primitive TAE descriptor
The <name> element within <analysisEngineMetaData>
Within a <fileResourceSpecifier> or <fileLanguageResourceSpecifier>

For example, if the value of a configuration parameter were specified as: <string><envVarRef>TEMP_DIR</envVarRef>/temp.dat</string>, and the value of the TEMP_DIR Java System property were c:/temp, then the configuration parameter's value would evaluate to c:/temp/temp.dat.

Aggregate Analysis Engine Descriptors

Aggregate Analysis Engines do not contain an annotator, but instead contain one or more component (also called delegate) analysis engines.

Aggregate Analysis Engine Descriptors maintain most of the same structure as Primitive Analysis Engine Descriptors. The differences are:

An Aggregate Analysis Engine Descriptor contains the element <primitive>false</primitive> rather than <primitive>true</primitive>.
An Aggregate Analysis Engine Descriptor must not include a <annotatorImplementationName> element.
In place of the <annotatorImplementationName>, an Aggregate Analysis Engine Descriptor must have a <delegateAnalysisEngineSpecifiers> element. See Delegate Analysis Engine Specifiers.
An Aggregate Analysis Engine Descriptor may provide a <flowController> element immediately following the <delegateAnalysisEngineSpecifiers>. See <XREF: Flow Controller>.
Under the analysisEngineMetaData element, an Aggregate Analysis Engine Descriptor may specify an additional element -- <flowConstraints>. See FlowConstraints. Typically only one of <flowController> and <flowConstraints> are specified. If both are specified, the <flowController> takes precedence, and the flow controller implementation can use the information in specified in the <flowConstraints> as part of its configuration input.
An aggregate Analysis Engine Descriptors must not contain a <typeSystemDescription> element. The Type System of the Aggregate Analysis Engine is derived by merging the Type System of the Analysis Engines that the aggregate contains.
Within aggregate Analysis Engine Descriptors, <configurationParameter> elements may define <overrides>. See Configuration Parameter Overrides.
External Resource Bindings can bind resources to dependencies declared by any delegate AE within the aggregate. See External Resource Bindings.
An additional optional element, <sofaMappings>, may be included.

Delegate Analysis Engine Specifiers

</delegateAnalysisEngine>

...

</delegateAnalysisEngine>

...

</delegateAnalysisEngineSpecifiers>

The delegateAnalysisEngineSpecifiers element contains one or more delegateAnalysisEngine elements. Each of these must have a unique key, and must contain either:

A complete analysisEngineDescription element describing the delegate analysis engine OR
An import element giving the name or location of the XML descriptor for the delegate analysis engine (see section 23.1 ).

The latter is the much more common usage, and is the only form supported by the Component Descriptor Editor tool.

FlowController

The optional flowController element identifies the descriptor of the FlowController component that will be used to determine the order in which delegate Analysis Engine are called.

The key attribute is optional, but recommended; it assigns the FlowController an identifier that can be used for configuration parameter overrides, Sofa mappings, or external resource bindings. The key must not be the same as any of the delegate analysis engine keys.

As with the delegateAnalysisEngine element, the flowController element may contain either a complete flowControllerDescription or an import, but the import is recommended. The Component Descriptor Editor tool only supports imports here.

FlowConstraints

If a <flowController> is not specified, the order in which delegate Analysis Engines are called within the aggregate Analysis Engine is specified using the <flowConstraints> element, which must occur immediately following the configurationParameterSettings element. If a <flowController> is specified, then the <flowConstraints> are optional. They can be used to pass an ordering of delegate keys to the <flowController>.

There are two options for flow constraints -- <fixedFlow> or <capabilityLanguageFlow>. Each is discussed in a separate section below.

Fixed Flow

<fixedFlow> <node>[String]</node> <node>[String]</node> ... </fixedFlow>

</flowConstraints>

The flowConstraints element must be included immediately following the configurationParameterSettings element.

Currently the flowConstraints element must contain a fixedFlow element. Eventually, other types of flow constraints may be possible.

The fixedFlow element contains one or more node elements, each of which contains an identifier which must match the key of a delegate analysis engine specified in the delegateAnalysisEngineSpecifiers element.

Capability Language Flow

<flowConstraints> <capabilityLanguageFlow> <node>[String]</node> <node>[String]</node> ... </capabilityLanguageFlow> </flowConstraints>

If you use <capabilityLanguageFlow>, the delegate Analysis Engines named by the <node> elements are called in the given order, except that a delegate Analysis Engine is skipped if any of the following are true (according to that Analysis Engine's declared output capabilities):

It cannot produce any of the aggregate Analysis Engine's output capabilities for the language of the current document.
All of the output capabilities have already been produced by an earlier Analysis Engine in the flow.

For example, if two annotators produce org.myorg.TokenAnnotation feature structures for the same language, these feature structures will only be produced by the first annotator in the list.

Configuration Parameter Overrides

In an aggregate Analysis Engine Descriptor, each <configurationParameter>element should contain an <overrides> element, with the following syntax:

<parameter> [delegateAnalysisEngineKey]/[parameterName] </parameter>

<parameter> [delegateAnalysisEngineKey]/[parameterName] </parameter> ...

</overrides>

Since aggregate Analysis Engines have no code associated with them, the only way in which their configuration parameters can affect their processing is by overriding the parameter values of one or more delegate analysis engines. The <overrides> element determines which parameters, in which delegate Analysis Engines, are overridden by this configuration parameter.

For example, consider an aggregate Analysis Engine Descriptor that contains delegate Analysis Engines with keys annotator1 and annotator2 (as declared in the <delegateAnalysisEngine> element – see Delegate Analysis Engine Specifiers) and also declares a configuration parameter as follows:

<configurationParameter> <name>AggregateParam</name> <type>String</type> <overrides> <parameter>annotator1/param1</parameter> <parameter>annotator2/param2</parameter> </overrides> </configurationParameter>

The value of the AggregateParam parameter (whether assigned in the aggregate descriptor or at runtime by an application) will override the value of parameter param1 in annotator1 and also override the value of parameter param2 in annotator2. No other parameters will be affected.

For historical reasons only, if an aggregate Analysis Engine descriptor declares a configuration parameter with no explicit overrides, that parameter will override any parameters having the same name within any delegate analysis engine. This usage is strongly discouraged. The UIMA SDK currently supports this usage but logs a warning message to the log file. This support may be dropped in future versions.

External Resource Bindings

Aggregate analysis engine descriptors can declare resource bindings that bind resources to dependencies declared in any of the delegate analysis engines (or their subcomponents, recursively) within that aggregate. This allows resource sharing. Any binding at this level overrides (supersedes) any binding specified by a contained component or their subcomponents, recursively.

For example, consider an aggregate Analysis Engine Descriptor that contains delegate Analysis Engines with keys annotator1 and annotator2 (as declared in the <delegateAnalysisEngine> element – see Delegate Analysis Engine Specifiers), where annotator1 declares a resource dependency with key myResource and annotator2 declares a resource dependency with key someResource.

Within that aggregate Analysis Engine Descriptor, the following resourceManagerConfiguration would bind both of those dependencies to a single external resource file.

<externalResources> <externalResource> <name>ExampleResource</name> <fileResourceSpecifier> <fileUrl>file:MyResourceFile.dat</fileUrl> </fileResourceSpecifier> </externalResource> </externalResources>

<externalResourceBindings> <externalResourceBinding> <key>annotator1/myResource</key> <resourceName>ExampleResource</resourceName> </externalResourceBinding> <externalResourceBinding> <key>annotator2/someResource</key> <resourceName>ExampleResource</resourceName> </externalResourceBinding> </externalResourceBindings>

</resourceManagerConfiguration>

The syntax for the externalResources declaration is exactly the same as described previously. In the resource bindings note the use of the compound keys, e.g. annotator1/myResource. This identifies the resource dependency key myResource within the annotator with key annotator1. Compound resource dependencies can be multiple levels deep to handle nested aggregate analysis engines.

Sofa Mappings

Sofa mappings are specified between Sofa names declared in this aggregate descriptor as part of the <capability> section, and the Sofa names declared in the delegate components. For purposes of the mapping, all the declarations of Sofas in any of the capability sets contained within the <capabilities> element are considered together.

<sofaMappings> <sofaMapping> <componentKey>[keyName]</componentKey> <componentSofaName>[sofaName]</componentSofaName> <aggregateSofaName>[sofaName]</aggregateSofaName> </sofaMapping> ... </sofaMappings>

The <componentSofaName> may be omitted in the case where the component is not aware of Multiple Views or Sofas. In this case, the UIMA framework will arrange for the specified <aggregateSofaName> to be the one visible to the delegate component.

The <componentKey> is the key name for the component as specified in the list of delegate components for this aggregate.

The sofaNames used must be declared as input or output sofas in some capability set.

The basic structure of a Flow Controller Descriptor is as follows:

<?xml version="1.0" encoding="UTF-8" ?>

<implementationName>[ClassName]</implementationName>

</flowControllerDescription>

The frameworkImplementation element must always be set to the value com.ibm.uima.java.

The implementationName element must contain the fully-qualified class name of the Flow Controller implementation. This must name a class that implements the FlowController interface.

The externalResourceDependencies and resourceManagerConfiguration elements are exactly the same as in Primitive Analysis Engine Descriptors (see 23-309 External Resource Dependencies and 23-309 Resource Manager Configuration).

There are three types of Collection Processing Components – Collection Readers, CAS Initializers, and CAS Consumers. Each type of component has a corresponding descriptor. The structure of these descriptors is very similar to that of primitive Analysis Engine Descriptors.

Collection Reader Descriptors

The basic structure of a Collection Reader descriptor is as follows:

<?xml version="1.0" encoding="UTF-8" ?>

<frameworkImplementation>com.ibm.uima.java</frameworkImplementation> <implementationName>[ClassName]</implementationName>

...

</resourceManagerConfiguration>

</collectionReaderDescription>

The frameworkImplementation element must always be set to the value com.ibm.uima.java.

The implementationName element contains the fully-qualified class name of the Collection Reader implementation. This must name a class that implements the CollectionReader interface.

The processingResourceMetaData element contains essentially the same information as a Primitive Analysis Engine Descriptor's' analysisEngineMetaData element:

<name> [String] </name> <description>[String]</description> <version>[String]</version> <vendor>[String]</vendor>

</processingResourceMetaData>

The contents of these elements are the same as that described in 23-297 Analysis Engine Metadata, with the exception that the capabilities section should not declare any inputs (because the Collection Reader is always the first component to receive the CAS).

The externalResourceDependencies and resourceManagerConfiguration elements are exactly the same as in the Primitive Analysis Engine Descriptors (see 23-309 External Resource Dependencies and 23-309 Resource Manager Configuration).

CAS Initializer Descriptors

The basic structure of a CAS Initializer Descriptor is as follows:

<?xml version="1.0" encoding="UTF-8" ?>

<frameworkImplementation>com.ibm.uima.java</frameworkImplementation> <implementationName>[ClassName] </implementationName>

</casInitializerDescription>

The frameworkImplementation element must always be set to the value com.ibm.uima.java.

The implementationName element contains the fully-qualified class name of the CAS Initializer implementation. This must name a class that implements the CasInitializer interface.

The processingResourceMetaData element contains essentially the same information as a Primitive Analysis Engine Descriptor's' analysisEngineMetaData element, as described in Section 23-297 Analysis Engine Metadata, with the exception of some changes to the capabilities section. A CAS Initializer's capabilities element looks like this:

<capability> <outputs> <type allAnnotatorFeatures="true|false">[String]</type> <type>[TypeName]</type> ... <feature>[TypeName]:[Name]</feature> ... </outputs>

...

</capabilities>

The differences between a CAS Initializer's capabilities declaration and a TAE's capabilities declaration are that the CAS Initializer does not declare any input CAS types and features or input Sofas (because it is always the first to operate on a CAS), it doesn't have a language specifier, and that the CAS Initializer may declare a set of MIME types that it supports for its input documents. Examples include: text/plain, text/html, and application/pdf. For a list of MIME types see http://www.iana.org/assignments/media-types/. This information is currently only for users' information, the framework does not use it for anything. This may change in future versions.

CAS Consumer Descriptors

The basic structure of a CAS Consumer Descriptor is as follows:

<?xml version="1.0" encoding="UTF-8" ?>

<implementationName>[ClassName] </implementationName>

</casConsumerDescription>

The frameworkImplementation element must always be set to the value com.ibm.uima.java.

The implementationName element must contain the fully-qualified class name of the CAS Consumer implementation. This must name a class that implements the CasConsumer interface.

The processingResourceMetaData element contains essentially the same information as a Primitive Analysis Engine Descriptor's analysisEngineMetaData element, described in Section 23-297 Analysis Engine Metadata, except that the CAS Consumer Descriptor's capabilities element should not declare outputs or outputSofas (since CAS Consumers do not modify the CAS).

Service Client Descriptors specify only a location of a remote service. They are therefore much simpler in structure. In the UIMA SDK, a Service Client Descriptor that refers to a valid Analysis Engine or CAS Consumer service can be used in place of the actual Analysis Engine or CAS Consumer Descriptor. The UIMA SDK will handle the details of calling the remote service. (For details on deploying an Analysis Engine or CAS Consumer as a service, see Chapter 24, Collection Processing Engine Descriptor Reference).

The UIMA SDK is extensible to support different types of remote services. In future versions, there may be different variations of service client descriptors that cater to different types of services. For now, the only type of service client descriptor is the uriSpecifier, which supports the SOAP and Vinci protocols.

<?xml version="1.0" encoding="UTF-8" ?> <uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier"> <resourceType>AnalysisEngine | CasConsumer </resourceType> <uri>[URI]</uri> <protocol>SOAP | SOAPwithAttachments | Vinci</protocol> <timeout>[Integer]</timeout> <parameters> <parameter name="VNS_HOST" value="some.internet.ip.name-or-address"/> <parameter name="VNS_PORT" value="9000"/> </parameters> </uriSpecifier>

The resourceType element is required for new descriptors, but is currently allowed to be omitted for backward compatibility. It specifies the type of component (Analysis Engine or CAS Consumer) that is implemented by the service endpoint described by this descriptor.

The uri element contains the URI for the web service. (Note that in the case of Vinci, this will be the service name, which is looked up in the Vinci Naming Service.)

The protocol element may be set to SOAP, SOAPwithAttachments, or Vinci; other protocols may be added later. These specify the particular data transport format that will be used.

The timeout element is optional. If present, it specifies the number of milliseconds to wait for a request to be processed before an exception is thrown. A value of zero or less will wait forever. If no timeout is specified, a default value (currently 60 seconds) will be used.

The parameter element is optional. If present, it specifies the Vinci naming service host and/or port number. If not present, the value used for these comes from parameters passed on the Java command line using the -DVNS_HOST=<host> and/or
-DVNS_PORT=<port> system arguments. If not present, and a system argument is also not present, the values for these default to localhost for the VNS_HOST and 9000 for the VNS_PORT.

For details on how to deploy and call Analysis Engine and CAS Consumer services, see Section 6.6 , Working with Analysis Engine and CAS Consumer Services.