UIMACPP API: uima::Language Class Reference

A language is specified as a string holding a 2-character language and an optional 2-character territory, i.e. as "ll-cc" or "ll".

String representation of the simple language sub-part is according to ISO standard 639 "Codes for the representation of the names of languages".

String representation of the territory sub-part is according to ISO standard 3166 "Codes for the Representation of Names of Countries".

String representation of the full language object is according to IANA RFC 1766 "Tags for the Identification of Languages": <LANG>-<SUBTAG>

There is also an internal technical numeric representation of a language as a 4 byte number (32 bit, high-word language value and low-word territory value). Conversion to or from numeric representation is provided via a constructor or conversion operator.

The class distinguishes between unspecified and invalid languages and territories. For example, in "en" the territory and sub-language is valid, but unspecified, as opposed to "en-US" where the territory and sub-language is specified as "US". However, in "en-FOO" the territory and sub-language is invalid as there is no such territory code.

Because of this, there is more than one way for two language objects to be compatible with each other:

Match type 2 is used if a annotator specifies that it can deal with any kind of english text and is not limited, or specialized to US-English.

     Language clLanguage(argv[1]);
     if(!clLanguage.isValid()) {
        // abort with error
           //...
     }
     if (! (   clLanguage.matches("en")
            || clLanguage.matches("de") ) ) {
        // abort with error
           //...
     }

Note: As the class is simple, compiler generated copy constructor and the assignment operator can be used.


Language constants and types
typedef long	TyLanguageAsNumber
	A typedef for representing a languages as a numeric value.
char const *	INVALID
	Special constants for the invalid & unspecified languages.
char const *	UNSPECIFIED
Public Member Functions
Constructors
	Language (void)
	Default Constructor: Language::UNSPECIFIED.
	Language (const TCHAR *cpszLanguageCode)
	Constructor from a C string.
	Language (const std::string &languageCode)
	Constructor from a single-byte string (std::string).
	Language (const icu::UnicodeString &languageCode)
	Constructor from a ICU Unicode string.
	Language (const UnicodeStringRef &languageCode)
	Constructor from a UnicodeStringRef.
	Language (TyLanguageAsNumber ulLanguageAsLong)
	Constructor from a 32 bit representation of a language (see asNumber).
Match functions
bool	operator== (const Language &crclObject) const
	Returns TRUE, if this language is identical to the specified language.
bool	operator!= (const Language &crclObject) const
	Returns TRUE, if this language is identical to the specified language.
bool	operator< (const Language &crclOther) const
	Returns TRUE, if this language code sorts before the specified language.
bool	matches (const Language &crclCompareLang) const
	Returns TRUE, if the languages are identical and either the territories are equal or one is unspecified, or if one of the languages is unspecified.
Miscellaneous
bool	isValid (void) const
	Returns TRUE if language is valid (territory may be missing).
const char *	getLanguageCode (void) const
	Get just the 2-character language part, or an empty string if unspecified.
TyLanguageAsNumber	getLanguage (void) const
	Get a numeric form of just the language (2-characters in top 2-bytes).
bool	hasLanguage (void) const
	Returns TRUE if language has been specified.
const char *	getTerritoryCode (void) const
	Get just the 2-character territory part, or an empty string if unspecified.
TyLanguageAsNumber	getTerritory (void) const
	Get a numeric form of just the territory (2-characters in bottom 2-bytes).
bool	hasTerritory (void) const
	Returns TRUE if territory has been specified.
void	setValue (const std::string &crclString)
	Sets the value according to string `crclString`.
std::string	asString (void) const
	Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>.
icu::UnicodeString	asUnicodeString (void) const
	Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>.
TyLanguageAsNumber	asNumber (void) const
	Returns the object as a 4-byte "number" (actually just the 4 character bytes, e.g.

Member Typedef Documentation

Constructor & Destructor Documentation

uima::Language::Language ( void ) [inline]

Default Constructor: Language::UNSPECIFIED.

uima::Language::Language ( const TCHAR * cpszLanguageCode ) [inline]

Constructor from a C string.

String must have form language_territory. For example, "en-US" or just language "en".

uima::Language::Language ( const std::string & languageCode ) [inline]

Constructor from a single-byte string (std::string).

String must have form language_territory. For example, "en-US" or just language "en".

uima::Language::Language ( const icu::UnicodeString & languageCode ) [inline]

Constructor from a ICU Unicode string.

String must have form language_territory. For example, "en-US" or just language "en".

uima::Language::Language ( const UnicodeStringRef & languageCode ) [inline]

Constructor from a UnicodeStringRef.

String must have form language_territory. For example, "en-US" or just language "en".

uima::Language::Language

(

TyLanguageAsNumber

ulLanguageAsLong

)

Constructor from a 32 bit representation of a language (see asNumber).

Member Function Documentation

bool uima::Language::operator== ( const Language & crclObject ) const [inline]

Returns TRUE, if this language is identical to the specified language.

bool uima::Language::operator!= ( const Language & crclObject ) const [inline]

Returns TRUE, if this language is identical to the specified language.

bool uima::Language::operator< ( const Language & crclOther ) const [inline]

Returns TRUE, if this language code sorts before the specified language.

bool uima::Language::matches

(

const Language &

crclCompareLang

)

const

Returns TRUE, if the languages are identical and either the territories are equal or one is unspecified, or if one of the languages is unspecified.

bool uima::Language::isValid ( void ) const [inline]

Returns TRUE if language is valid (territory may be missing).

const char * uima::Language::getLanguageCode ( void ) const [inline]

Get just the 2-character language part, or an empty string if unspecified.

Language::TyLanguageAsNumber uima::Language::getLanguage ( void ) const [inline]

Get a numeric form of just the language (2-characters in top 2-bytes).

bool uima::Language::hasLanguage ( void ) const [inline]

Returns TRUE if language has been specified.

const char * uima::Language::getTerritoryCode ( void ) const [inline]

Get just the 2-character territory part, or an empty string if unspecified.

Language::TyLanguageAsNumber uima::Language::getTerritory ( void ) const [inline]

Get a numeric form of just the territory (2-characters in bottom 2-bytes).

bool uima::Language::hasTerritory ( void ) const [inline]

Returns TRUE if territory has been specified.

void uima::Language::setValue ( const std::string & crclString ) [inline]

Sets the value according to string crclString.

crclString must have the form <LANG_ID>[-<TERR_ID>].

std::string uima::Language::asString ( void ) const [inline]

Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>.

For example, en-US.

icu::UnicodeString uima::Language::asUnicodeString ( void ) const [inline]

Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>.

For example, en-US.

Language::TyLanguageAsNumber uima::Language::asNumber ( void ) const [inline]

Returns the object as a 4-byte "number" (actually just the 4 character bytes, e.g.

x656e7472 'enus')

Member Data Documentation

char const* uima::Language::INVALID [static]

Special constants for the invalid & unspecified languages.

uima::Language Class Reference

Detailed Description

Language constants and types

Public Member Functions

Member Typedef Documentation

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation