Language
models languages in UIMACPP.
A language is specified as a string holding a 2-character language and an optional 2-character territory, i.e. as "ll-cc" or "ll".
String representation of the simple language sub-part is according to ISO standard 639 "Codes for the representation of the names of languages".
String representation of the territory sub-part is according to ISO standard 3166 "Codes for the Representation of Names of Countries".
String representation of the full language object is according to IANA RFC 1766 "Tags for the Identification of Languages": <LANG>-<SUBTAG>
There is also an internal technical numeric representation of a language as a 4 byte number (32 bit, high-word language value and low-word territory value). Conversion to or from numeric representation is provided via a constructor or conversion operator.
The class distinguishes between unspecified and invalid languages and territories. For example, in "en" the territory and sub-language is valid, but unspecified, as opposed to "en-US" where the territory and sub-language is specified as "US". However, in "en-FOO" the territory and sub-language is invalid as there is no such territory code.
Because of this, there is more than one way for two language objects to be compatible with each other:
Match type 2 is used if a annotator specifies that it can deal with any kind of english text and is not limited, or specialized to US-English.
Match type 3 is not supported.
Language clLanguage(argv[1]); if(!clLanguage.isValid()) { // abort with error //... } if (! ( clLanguage.matches("en") || clLanguage.matches("de") ) ) { // abort with error //... }
Note: As the class is simple, compiler generated copy constructor and the assignment operator can be used.
Language constants and types | |
typedef long | TyLanguageAsNumber |
A typedef for representing a languages as a numeric value. | |
char const * | INVALID |
Special constants for the invalid & unspecified languages. | |
char const * | UNSPECIFIED |
Public Member Functions | |
Constructors | |
Language (void) | |
Default Constructor: Language::UNSPECIFIED. | |
Language (const TCHAR *cpszLanguageCode) | |
Constructor from a C string. | |
Language (const std::string &languageCode) | |
Constructor from a single-byte string (std::string). | |
Language (const icu::UnicodeString &languageCode) | |
Constructor from a ICU Unicode string. | |
Language (const UnicodeStringRef &languageCode) | |
Constructor from a UnicodeStringRef. | |
Language (TyLanguageAsNumber ulLanguageAsLong) | |
Constructor from a 32 bit representation of a language (see asNumber). | |
Match functions | |
bool | operator== (const Language &crclObject) const |
Returns TRUE, if this language is identical to the specified language. | |
bool | operator!= (const Language &crclObject) const |
Returns TRUE, if this language is identical to the specified language. | |
bool | operator< (const Language &crclOther) const |
Returns TRUE, if this language code sorts before the specified language. | |
bool | matches (const Language &crclCompareLang) const |
Returns TRUE, if the languages are identical and either the territories are equal or one is unspecified, or if one of the languages is unspecified. | |
Miscellaneous | |
bool | isValid (void) const |
Returns TRUE if language is valid (territory may be missing). | |
const char * | getLanguageCode (void) const |
Get just the 2-character language part, or an empty string if unspecified. | |
TyLanguageAsNumber | getLanguage (void) const |
Get a numeric form of just the language (2-characters in top 2-bytes). | |
bool | hasLanguage (void) const |
Returns TRUE if language has been specified. | |
const char * | getTerritoryCode (void) const |
Get just the 2-character territory part, or an empty string if unspecified. | |
TyLanguageAsNumber | getTerritory (void) const |
Get a numeric form of just the territory (2-characters in bottom 2-bytes). | |
bool | hasTerritory (void) const |
Returns TRUE if territory has been specified. | |
void | setValue (const std::string &crclString) |
Sets the value according to string crclString . | |
std::string | asString (void) const |
Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>. | |
icu::UnicodeString | asUnicodeString (void) const |
Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>. | |
TyLanguageAsNumber | asNumber (void) const |
Returns the object as a 4-byte "number" (actually just the 4 character bytes, e.g. |
|
A typedef for representing a languages as a numeric value.
|
|
Default Constructor: Language::UNSPECIFIED.
|
|
Constructor from a C string.
String must have form |
|
Constructor from a single-byte string (std::string).
String must have form |
|
Constructor from a ICU Unicode string.
String must have form |
|
Constructor from a UnicodeStringRef.
String must have form |
|
Constructor from a 32 bit representation of a language (see asNumber).
|
|
Returns TRUE, if this language is identical to the specified language.
|
|
Returns TRUE, if this language is identical to the specified language.
|
|
Returns TRUE, if this language code sorts before the specified language.
|
|
Returns TRUE, if the languages are identical and either the territories are equal or one is unspecified, or if one of the languages is unspecified.
|
|
Returns TRUE if language is valid (territory may be missing).
|
|
Get just the 2-character language part, or an empty string if unspecified.
|
|
Get a numeric form of just the language (2-characters in top 2-bytes).
|
|
Returns TRUE if language has been specified.
|
|
Get just the 2-character territory part, or an empty string if unspecified.
|
|
Get a numeric form of just the territory (2-characters in bottom 2-bytes).
|
|
Returns TRUE if territory has been specified.
|
|
Sets the value according to string
|
|
Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>. For example, en-US. |
|
Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>. For example, en-US. |
|
Returns the object as a 4-byte "number" (actually just the 4 character bytes, e.g. x656e7472 'enus') |
|
Special constants for the invalid & unspecified languages.
|
|
|