Text Conversion Functions |
![]() |
This text describes the functions rtl_convertTextToUnicode()
and rtl_convertUnicodeToText()
, the meaning of all the
accompanying RTL_TEXTTOUNICODE_FLAGS_XXX
,
RTL_TEXTTOUNICODE_INFO_XXX
,
RTL_UNICODETOTEXT_FLAGS_XXX
and
RTL_UNICODETOTEXT_INFO_XXX
flags, and the conversion
context conventions.
It is valid to pass a null pointer instead of an
rtl_TextToUnicodeContext
or rtl_UnicodeToTextContext
to the conversion functions. In that case, the functions behave as if they
received an initial context, as obtained by
rtl_createTextToUnicodeContext()
,
rtl_resetTextToUnicodeContext()
,
rtl_createUnicodeToTextContext()
, or
rtl_resetUnicodeToTextContext()
, and simply do not return any
context information (which is effectively lost). This implies that you should
always specify the FLAGS_FLUSH
flag when using a null context,
for otherwise it is not possible in general to find out whether the input
buffer has been completely converted.
An undefined code is any of the following:
0xA5
in ISO 8859-3,
0xA2A1
in EUC-CN, and 0x167F
in Unicode.0x0100
in Unicode, which cannot be mapped to ISO 8859-1; and
0xA698
in HangulTalk, which cannot be mapped to
Unicode.In the text-to-Unicode direction, the conversion functions distinguish
between single-byte and multi-byte undefined codes (0xA5
in
ISO 8859-3 and 0x80
in GB-18030 are single-byte undefined codes,
while 0xA2A1
in EUC-CN and 0xFE39FE39
in GB-18030
are multi-byte undefined codes.)
When encountering an undefined code, the conversion functions allow any of the following behaviours (which are mutually exclusive):
FLAGS_UNDEFINED_ERROR
FLAGS_MBUNDEFINED_ERROR
INFO_UNDEFINED
or INFO_MBUNDEFINED
and the
INFO_ERROR
flags, and immediately quit the conversion
(ignoring any FLAGS_FLUSH
flag).FLAGS_UNDEFINED_IGNORE
FLAGS_MBUNDEFINED_IGNORE
INFO_UNDEFINED
or INFO_MBUNDEFINED
flag, and
continue with the conversion.FLAGS_UNDEFINED_MAPTOPRIVATE
INFO_UNDEFINED
flag, write U+F1xx
into the output buffer (where
xx
is the single-byte undefined code), and
continue with the conversion.FLAGS_UNDEFINED_0
INFO_UNDEFINED
flag, write an (appropriately encoded) ASCII NUL
character
(0x00
) into the output buffer, and continue with the
conversion.FLAGS_UNDEFINED_QUESTIONMARK
INFO_UNDEFINED
flag, write an (appropriately encoded) ASCII “?
”
character (0x3F
) into the output buffer, and continue with
the conversion.FLAGS_UNDEFINED_UNDERLINE
INFO_UNDEFINED
flag, write an (appropriately encoded) ASCII “_
”
character (0x5F
) into the output buffer, and continue with
the conversion.FLAGS_UNDEFINED_DEFAULT
INFO_UNDEFINED
flag, write some output-encoding–specific character (currently
U+FFFD
for Unicode and “?
” for all
other encodings) into the output buffer, and continue with the
conversion.In the Unicode-to-text direction, the conversion functions also allow any
of the following extra flags (of which an arbitrary number can be specified).
In all cases, the usual checks for an exhausted output
buffer are made, and otherwise the INFO_UNDEFINED
flag is
set.
FLAGS_UNDEFINED_REPLACE
U+00A0
(NO-BREAK
SPACE) could be mapped to 0x20
(SPACE)
in ASCII. Expect this to be poorly supported by the current
implementation.FLAGS_UNDEFINED_REPLACESTR
U+00A9
(COPYRIGHT
SIGN) could be mapped to the three-character string
“(C)
” in ASCII. Expect this to be poorly
supported by the current implementation.FLAGS_PRIVATE_MAPTO0
U+E000
–U+F8FF
,
U+F0000
–U+FFFFD
, and
U+100000
–U+10FFFD
) are mapped to an
(appropriately encoded) ASCII NUL
character
(0x00
) in the output buffer.FLAGS_NONSPACING_IGNORE
U+200B
(ZERO
WIDTH SPACE) and U+FEFF
(ZERO WIDTH NO-BREAK
SPACE), are ignored. Expect some uncertainty in the current
implementation as to which characters are affected.FLAGS_CONTROL_IGNORE
U+0000
–U+001F
and
U+007F
–U+009F
) are ignored.FLAGS_PRIVATE_IGNORE
U+E000
–U+F8FF
,
U+F0000
–U+FFFFD
, and
U+100000
–U+10FFFD
) are ignored.There is also a FLAGS_NOCOMPOSITE
flag, of which I am not sure
what it should be used for.
An invalid code is a string of one or more units in the input buffer that is not valid according to the input encoding:
0x80
in ASCII, or 0xFF
in GB-18030).0xD800
in Unicode, with a following low-surrogate missing, or
0xA1
in EUC-CN, with a second byte in the range
0xA1
–0xFE
missing).Invalid codes of the second category (that are potentially prefixes of
valid strings) are handled specially at the end of the input buffer. If the
FLAGS_FLUSH
flag is specified, they are handled like all other
invalid codes. Otherwise, the INFO_SRCBUFFERTOSMALL
flag is set
to indicate that the input buffer possibly ended in the middle of an input
character (and the prefix is either not yet read, or is stored in the
conversion context, or is partly read and partly stored in the conversion
context).
When encountering an invalid code (other than the special cases at the end of the input buffer), the conversion functions allow any of the following behaviours (which are mutually exclusive):
FLAGS_INVALID_ERROR
INFO_INVALID
and the INFO_ERROR
flags, and
immediately quit the conversion (ignoring any FLAGS_FLUSH
flag).FLAGS_INVALID_IGNORE
INFO_INVALID
flag, and continue with the conversion.FLAGS_INVALID_0
INFO_INVALID
flag,
write an (appropriately encoded) ASCII NUL
character
(0x00
) into the output buffer, and continue with the
conversion.FLAGS_INVALID_QUESTIONMARK
INFO_INVALID
flag,
write an (appropriately encoded) ASCII “?
”
character (0x3F
) into the output buffer, and continue with
the conversion.FLAGS_INVALID_UNDERLINE
INFO_INVALID
flag,
write an (appropriately encoded) ASCII “_
”
character (0x5F
) into the output buffer, and continue with
the conversion.FLAGS_INVALID_DEFAULT
INFO_INVALID
flag,
write some output-encoding–specific character (currently
U+FFFD
for Unicode and “?
” for all
other encodings) into the output buffer, and continue with the
conversion.If, in the course of conversion, there is not enough space left in the
output buffer (either for a normal character mapping or for a special mapping
of undefined or invalid codes), the INFO_DESTBUFFERTOSMALL
flag
is set, and the conversion is quit immediately (ignoring any
FLAGS_FLUSH
flag). It is unspecified whether the input units
that would overflow the output buffer are already read (and stored in the
conversion context) or not, but the number of processed input buffer units
returned by the conversion function will be correct in either case.
Author: Stephan Bergmann (Last modification $Date: 2004/12/08 14:22:01 $). Copyright 2001 OpenOffice.org Foundation. All Rights Reserved. |