URIs in UDK |
This text describes how Uniform Resource Identifiers (URIs, see RFC 2396) are used within the UDK and within OpenOffice.org in general. URIs encompass both the widely known Uniform Resource Locators (URLs) and the lesser known Uniform Resource Names (URNs), but if you only know the term URL and neither URI nor URN, that's just fine, as the discussion here in fact centers around only a few URL schemes.
Used URL Schemes
Currently, the UDK only uses two URL schemes directly: file URLs and uno URLs. File URLs are defined in RFC 1738, but that definition leaves the semantics somewhat open. The OpenOffice.org code chose to use a certain interpretation of file URLs (described in this text), but that interpretation can be incompatible with the interpretations used by other programs.
Uno URLs follow a private scheme that is explained in the document UNO-Url.
Other URL schemes (http, ftp, etc.) are used within OpenOffice.org (mainly
through the class INetURLObject
in the tools
project),
and many of them suffer from the same problems as file URLs (see below).
File URL Basics
A file URL consists of encoded data, interspersed with some delimiting characters. Considering the file URL
file://host/seg1/seg2/…/segn
the host
and seg1
,
seg2
, …,
segn
parts are the encoded data, and the
rest (the file://
and the single slashes) are syntactic delimiters.
The encoded data in the segi
parts are
sequences of bytes, written using ASCII characters (that represent the numeric
values of the ASCII characters themselves) and escape sequences (a
%
followed by two hexadecimal digits, representing any numeric
value in the range 0–255). (The encoded data in the optional
host
part follows other rules.)
File URLs are used to locate files on a certain machine, that is, they
somehow have to encode (platform-dependent) file system paths as used on that
machine in their segi
parts. The problem is
that there is no global specification of exactly how file URLs encode file
system paths.
The strategy chosen by the OpenOffice.org code maps from file system paths (as used in some operating system's interfaces) to file URLs in two steps. In the first step (which is platform dependent), a hierarchical file system path is translated into a sequence of segments, represented in Unicode. In the second step (which is platform independent), those segments are translated into sequences of bytes, using UTF-8, and then concatenated into URLs (encoding the individual bytes as single ASCII characters or as escape sequences).
As an example, consider the Windows file system path
C:\directory\other dir\file.txt
This is first translated into the four segments C:
,
directory
, other dir
, and file.txt
(all
represented using Unicode). Using UTF-8, these segments are then translated
into the corresponding byte sequences (represented here as sequences of
hexadecimal numbers):
- 43 3A (
C:
) - 64 69 72 65 63 74 6F 72 79 (
directory
) - 6F 74 68 65 72 20 64 69 72 (
other dir
) - 66 69 6C 65 2E 74 78 74 (
file.txt
)
Then, these byte sequences are combined into a single file URL (adding the necessary syntactic delimiters), namely
file:///C:/directory/other%20dir/file.txt
Nothing exciting here (all the bytes from the four sequences are represented
as the corresponding ASCII characters, except for the space in other
dir
, which is illegal in URLs and is thus escaped as
%20
).
Similarly, a Unix file system path
/directory/other dir/file.txt
is translated into the URL
file:///directory/other%20dir/file.txt
Non-ASCII Characters
On many platforms, file system paths can contain non-ASCII characters. This is typically handled within the operating system by naming files with byte strings, and letting the user choose a character encoding (via a locale) that specifies how these byte strings are to be interpreted.
Consider the Unix file system path /stränge
(assuming the
user's locale is such that it contains an ä in the supported character
repertoire). Again, translation into a file URL is done in two steps: First,
the path is split into the single segment stränge
(represented
as a Unicode string; some platform-dependent “magic” is needed to
convert from the operation system's representation to this Unicode
representation). Second, that segment is translated into the (hexadecimal) byte
sequence 73 74 72 C3 A4 6E 67 65, and the URL file:///str%C3%A4nge
is constructed from that byte sequence.
Other programs may handle file URLs differently, in that they directly use
the operating system's byte strings (interpreted in a locale chosen by the user)
within the file URL, without going via Unicode and UTF-8. For example, if the
user had chosen a ISO 8859-1 locale, the Unix file system path
/stränge
(possibly represented as the hexadecimal byte string
2F 73 74 72 E4 6E 67 65 by the operating system) would correspond to the URL
file:///str%E4nge
.
If the OpenOffice.org code wants to exchange file URLs with such other programs, the URLs have to be converted back and forth at the interfaces, to avoid any misunderstandings. This problem is typically not noticed when the file system paths contain only ASCII characters, as these are always represented the same within file URLs.
Both approaches (“UTF-8 URLs” and “locale dependent URLs”) have pros and cons, but the main benefit of the OpenOffice.org (i.e., UTF-8 URLs) approach seems to be the stability of how a file URL locates a specific file, regardless of context. Imagine a text processor application that lets you include URLs as hyperlinks within your documents, and imagine that application used locale dependent URLs. When creating a document, a user has specified locale X for his operating system, and a file URL included in the document is therefore encoded using the conventions of locale X. Now, the user switches to locale Y and re-opens the document: The hyperlink is no longer guaranteed to point to the same file, as the file URL is not encoded using the conventions of locale Y.
Also, with the UTF-8 URLs approach, code that handles file URLs can often be
made simpler and platform-independent. A typical scenario is a text edit field
allowing the user to enter a (file) URL. Many users are not familiar with the
nitty details of URLs, and will type in things like
file:///stränge
. Even though this is not a valid URL (since
an ä
is not allowed within a URL), it would be nice if the
application were forgiving and would handle that input as locating the file
/stränge
. With the UTF-8 URLs approach, this is easy, as the
text edit can silently convert the input into the correct URL
file:///str%C3%A4nge
. With the locale dependent URLs approach,
this would be more difficult, as the text edit would have to know which locale
is in use, to convert the input into something like
file:///str%E4nge
, but only in case the ISO 8859-1 locale was
used.
Note that these problems of interpreting non-ASCII characters are not restricted to file URLs. Other URL schemes that do not explicitly state what character encoding to use (like http and ftp URLs) have similar problems. As Richard Gillam puts it in his book Unicode Demystified (Addison-Wesley, 2003):
The industry is converging around always treating escape sequences in URLs as referring to UTF-8 code units. That is, the industry is leaning toward always interpreting
R%c3%a9sum%c3%a9.html
to meanRésumé.html
(and always representingRésumé.html
asR%c3%a9sum%c3%a9.html
). If everyone agreed on this system, then you could use illegal URL characters (such as the accented é in our example) in URL references in other kinds of documents (such as HTML or XML files) and know that a universally understood method of transforming them into a legal URL existed. Web browsers or other software could do the reverse, displaying URLs that include escape sequences by using the characters the escape sequences represent (at least in the cases where they represent non-ASCII characters) and allowing you to type them in that way.
The UTF-8 URLs Approach
There are a number of problems specific to the UTF-8 URLs approach:
First, the UTF-8 URLs approach has an additional step compared to the locale dependent URLs approach, namely the translation between a system specific representation of some textual entity and a Unicode representation of that entity. In the one direction, a problem might occur when a certain entity represented in the system specific way cannot be represented in Unicode (e.g., because it contains characters that are not present in the Unicode repertoire). Given that Unicode was specifically designed to encompass the repertoires of all legacy character encodings in use today, chances for such a problem should be close to zero. In the other direction, different Unicode strings could translate to the same system specific representation (e.g., because two different Unicode characters are mapped to the same character in the system specific encoding). This leads to two different URLs locating the same resource, something that should not be considered much of a problem, since it is already a wide-spread phenomenon (think about file URLs differing in the use of upper and lower case letters on a case-insensitive file system, or think about file systems that support links).
Another problem stems from the fact that Unicode allows a single “conceptual character” (as interpreted by a user, e.g., the character “ä”) to be represented in different ways. For example, the conceptual character “ä” can be represented as either the single code point
- U+00E4 LATIN SMALL LETTER A WITH DIARESIS
or as a sequence of the two code points
- U+0061 LATIN SMALL LETTER A
- U+0308 COMBINING DIARESIS
(a so-called “combining character sequence”). Both
representations should be considered equivalent, so that the two URLs
file:///str%C3%A4nge
(containing a UTF-8 encoded U+00E4) and
file:///stra%CC%88nge
(containing a U+0061 represented as
ASCII a
followed by a UTF-8 encoded U+0308) should probably be
considered equivalent also, and should both denote a file named
stränge
. In current versions of OpenOffice.org
(“SRC644”), loading a file named stränge
on
Windows XP works with the former URL, but fails with the latter.
Two solutions for this problem seem possible: One solution would be to
enhance the (platform dependent) code that maps from Unicode to a system
specific character encoding so that it handles combining character sequences
correctly. Another solution is to require all URLs to use only one form of
Unicode strings (i.e., to use a normalization form), which would make
this problem go away. The obvious choice is to use
Unicode Normalization
Form C, as is also recommended by the W3C's
Character Model for the World Wide
Web 1.0. Using that solution, only file:///str%C3%A4nge
would be a valid URL to access the file stränge
, while the URL
file:///stra%CC%88nge
would be ruled out as invalid.
Following the approach of requiring URLs to use a normalization form, a new
problem might show up: Consider an operating system that allows files to be
named using some Unicode encoding (any of the UTF-n encoding
forms/schemes), but that is uncompliant enough to allow two different files to
have names that a Unicode-compliant system should consider equivalent (e.g., a
file named stränge
written using U+00E4 and another file named
stränge
written using U+0061 followed by U+0308). Now, the
requirement to always use normalized Unicode strings within file URLs makes it
impossible to access one of the two files with a URL. (This is another
manifestation of the problem already described above, that an entity represented
in the system specific way cannot be presented in Unicode—or, more
specifically, in some Unicode normalization form. Above, that problem was said
to be ignorable; here, one can only hope for Unicode-compliant operating systems
that do not allow two different files to have equivalent names.)
Drive Letters
In Windows, as well as in related systems like DOS and OS/2, file system
paths start with a drive letter, followed by a colon (e.g., the
C:
in C:\directory\file.txt
). That drive letter
(together with the following colon) makes up the first segment of file URLs on
these systems (as in file:///C:/directory/file.txt
). However,
historically also a vertical bar has been used instead of the colon, as in
file:///C|/directory/file.txt
(note that the vertical bar is not
escaped as %7C
in this special case, even though it is an illegal
character).
The OpenOffice.org code can handle both cases of file URLs (with either a colon or a vertical bar), but URLs generated by the code always follow the “standard” convention of using a colon.
Hosts
The file URL scheme allows for an optional host
component after the file://
prefix. Specifying
localhost
(in any combination of upper and lower case letters) is
the same as leaving that component empty: the URL locates a file system path on
the current machine.
The intended use for this host
component was to
specify the DNS name (or IPv4/IPv6 address) of a machine, to indicate that the
URL locates a file system path on that machine. The problem is that there is no
protocol that details how such a remote file should be accessed, so
interpretation of file URLs containing a host
component
was left unspecified.
On Unix, the OpenOffice.org code does not support file URLs with a
host
component. Windows, on the other hand, knows the
concept of UNC paths, file system paths containing the name of a
remote machine:
\\somewhere\somedir\file.txt
The machine names used in UNC paths (somewhere
in the above
example) have another structure than DNS names, so, strictly speaking, they
could not be used in the host
component of file URLs.
Nevertheless, many applications on Windows, including the OpenOffice.org code,
follow the convention of allowing UNC machine names within file URLs. This
means that the above UNC path corresponds with the URL
file://somewhere/somedir/file.txt
Author: Stephan Bergmann (Last modification $Date: 2004/12/08 12:03:50 $). |