URIs in UDK

This text describes how Uniform Resource Identifiers (URIs, see RFC 2396) are used within the UDK and within OpenOffice.org in general. URIs encompass both the widely known Uniform Resource Locators (URLs) and the lesser known Uniform Resource Names (URNs), but if you only know the term URL and neither URI nor URN, that's just fine, as the discussion here in fact centers around only a few URL schemes.

Used URL Schemes

Currently, the UDK only uses two URL schemes directly: file URLs and uno URLs. File URLs are defined in RFC 1738, but that definition leaves the semantics somewhat open. The OpenOffice.org code chose to use a certain interpretation of file URLs (described in this text), but that interpretation can be incompatible with the interpretations used by other programs.

Uno URLs follow a private scheme that is explained in the document UNO-Url.

Other URL schemes (http, ftp, etc.) are used within OpenOffice.org (mainly through the class INetURLObject in the tools project), and many of them suffer from the same problems as file URLs (see below).

File URL Basics

A file URL consists of encoded data, interspersed with some delimiting characters. Considering the file URL

file://host/seg₁/seg₂/…/seg_n

the host and seg₁, seg₂, …, seg_n parts are the encoded data, and the rest (the file:// and the single slashes) are syntactic delimiters. The encoded data in the seg_i parts are sequences of bytes, written using ASCII characters (that represent the numeric values of the ASCII characters themselves) and escape sequences (a % followed by two hexadecimal digits, representing any numeric value in the range 0–255). (The encoded data in the optional host part follows other rules.)

File URLs are used to locate files on a certain machine, that is, they somehow have to encode (platform-dependent) file system paths as used on that machine in their seg_i parts. The problem is that there is no global specification of exactly how file URLs encode file system paths.

The strategy chosen by the OpenOffice.org code maps from file system paths (as used in some operating system's interfaces) to file URLs in two steps. In the first step (which is platform dependent), a hierarchical file system path is translated into a sequence of segments, represented in Unicode. In the second step (which is platform independent), those segments are translated into sequences of bytes, using UTF-8, and then concatenated into URLs (encoding the individual bytes as single ASCII characters or as escape sequences).

As an example, consider the Windows file system path

C:\directory\other dir\file.txt

This is first translated into the four segments C:, directory, other dir, and file.txt (all represented using Unicode). Using UTF-8, these segments are then translated into the corresponding byte sequences (represented here as sequences of hexadecimal numbers):

43 3A (C:)
64 69 72 65 63 74 6F 72 79 (directory)
6F 74 68 65 72 20 64 69 72 (other dir)
66 69 6C 65 2E 74 78 74 (file.txt)

Then, these byte sequences are combined into a single file URL (adding the necessary syntactic delimiters), namely

file:///C:/directory/other%20dir/file.txt

Nothing exciting here (all the bytes from the four sequences are represented as the corresponding ASCII characters, except for the space in other dir, which is illegal in URLs and is thus escaped as %20).

Similarly, a Unix file system path

/directory/other dir/file.txt

is translated into the URL

file:///directory/other%20dir/file.txt

Non-ASCII Characters

On many platforms, file system paths can contain non-ASCII characters. This is typically handled within the operating system by naming files with byte strings, and letting the user choose a character encoding (via a locale) that specifies how these byte strings are to be interpreted.

Consider the Unix file system path /stränge (assuming the user's locale is such that it contains an ä in the supported character repertoire). Again, translation into a file URL is done in two steps: First, the path is split into the single segment stränge (represented as a Unicode string; some platform-dependent “magic” is needed to convert from the operation system's representation to this Unicode representation). Second, that segment is translated into the (hexadecimal) byte sequence 73 74 72 C3 A4 6E 67 65, and the URL file:///str%C3%A4nge is constructed from that byte sequence.

Other programs may handle file URLs differently, in that they directly use the operating system's byte strings (interpreted in a locale chosen by the user) within the file URL, without going via Unicode and UTF-8. For example, if the user had chosen a ISO 8859-1 locale, the Unix file system path /stränge (possibly represented as the hexadecimal byte string 2F 73 74 72 E4 6E 67 65 by the operating system) would correspond to the URL file:///str%E4nge.

If the OpenOffice.org code wants to exchange file URLs with such other programs, the URLs have to be converted back and forth at the interfaces, to avoid any misunderstandings. This problem is typically not noticed when the file system paths contain only ASCII characters, as these are always represented the same within file URLs.

Both approaches (“UTF-8 URLs” and “locale dependent URLs”) have pros and cons, but the main benefit of the OpenOffice.org (i.e., UTF-8 URLs) approach seems to be the stability of how a file URL locates a specific file, regardless of context. Imagine a text processor application that lets you include URLs as hyperlinks within your documents, and imagine that application used locale dependent URLs. When creating a document, a user has specified locale X for his operating system, and a file URL included in the document is therefore encoded using the conventions of locale X. Now, the user switches to locale Y and re-opens the document: The hyperlink is no longer guaranteed to point to the same file, as the file URL is not encoded using the conventions of locale Y.

Also, with the UTF-8 URLs approach, code that handles file URLs can often be made simpler and platform-independent. A typical scenario is a text edit field allowing the user to enter a (file) URL. Many users are not familiar with the nitty details of URLs, and will type in things like file:///stränge. Even though this is not a valid URL (since an ä is not allowed within a URL), it would be nice if the application were forgiving and would handle that input as locating the file /stränge. With the UTF-8 URLs approach, this is easy, as the text edit can silently convert the input into the correct URL file:///str%C3%A4nge. With the locale dependent URLs approach, this would be more difficult, as the text edit would have to know which locale is in use, to convert the input into something like file:///str%E4nge, but only in case the ISO 8859-1 locale was used.

Note that these problems of interpreting non-ASCII characters are not restricted to file URLs. Other URL schemes that do not explicitly state what character encoding to use (like http and ftp URLs) have similar problems. As Richard Gillam puts it in his book Unicode Demystified (Addison-Wesley, 2003):

The industry is converging around always treating escape sequences in URLs as referring to UTF-8 code units. That is, the industry is leaning toward always interpreting R%c3%a9sum%c3%a9.html to mean Résumé.html (and always representing Résumé.html as R%c3%a9sum%c3%a9.html). If everyone agreed on this system, then you could use illegal URL characters (such as the accented é in our example) in URL references in other kinds of documents (such as HTML or XML files) and know that a universally understood method of transforming them into a legal URL existed. Web browsers or other software could do the reverse, displaying URLs that include escape sequences by using the characters the escape sequences represent (at least in the cases where they represent non-ASCII characters) and allowing you to type them in that way.

The UTF-8 URLs Approach

There are a number of problems specific to the UTF-8 URLs approach:

First, the UTF-8 URLs approach has an additional step compared to the locale dependent URLs approach, namely the translation between a system specific representation of some textual entity and a Unicode representation of that entity. In the one direction, a problem might occur when a certain entity represented in the system specific way cannot be represented in Unicode (e.g., because it contains characters that are not present in the Unicode repertoire). Given that Unicode was specifically designed to encompass the repertoires of all legacy character encodings in use today, chances for such a problem should be close to zero. In the other direction, different Unicode strings could translate to the same system specific representation (e.g., because two different Unicode characters are mapped to the same character in the system specific encoding). This leads to two different URLs locating the same resource, something that should not be considered much of a problem, since it is already a wide-spread phenomenon (think about file URLs differing in the use of upper and lower case letters on a case-insensitive file system, or think about file systems that support links).

Another problem stems from the fact that Unicode allows a single “conceptual character” (as interpreted by a user, e.g., the character “ä”) to be represented in different ways. For example, the conceptual character “ä” can be represented as either the single code point

U+00E4 LATIN SMALL LETTER A WITH DIARESIS

or as a sequence of the two code points

U+0061 LATIN SMALL LETTER A
U+0308 COMBINING DIARESIS

(a so-called “combining character sequence”). Both representations should be considered equivalent, so that the two URLs file:///str%C3%A4nge (containing a UTF-8 encoded U+00E4) and file:///stra%CC%88nge (containing a U+0061 represented as ASCII a followed by a UTF-8 encoded U+0308) should probably be considered equivalent also, and should both denote a file named stränge. In current versions of OpenOffice.org (“SRC644”), loading a file named stränge on Windows XP works with the former URL, but fails with the latter.

Two solutions for this problem seem possible: One solution would be to enhance the (platform dependent) code that maps from Unicode to a system specific character encoding so that it handles combining character sequences correctly. Another solution is to require all URLs to use only one form of Unicode strings (i.e., to use a normalization form), which would make this problem go away. The obvious choice is to use Unicode Normalization Form C, as is also recommended by the W3C's Character Model for the World Wide Web 1.0. Using that solution, only file:///str%C3%A4nge would be a valid URL to access the file stränge, while the URL file:///stra%CC%88nge would be ruled out as invalid.

Following the approach of requiring URLs to use a normalization form, a new problem might show up: Consider an operating system that allows files to be named using some Unicode encoding (any of the UTF-n encoding forms/schemes), but that is uncompliant enough to allow two different files to have names that a Unicode-compliant system should consider equivalent (e.g., a file named stränge written using U+00E4 and another file named stränge written using U+0061 followed by U+0308). Now, the requirement to always use normalized Unicode strings within file URLs makes it impossible to access one of the two files with a URL. (This is another manifestation of the problem already described above, that an entity represented in the system specific way cannot be presented in Unicode—or, more specifically, in some Unicode normalization form. Above, that problem was said to be ignorable; here, one can only hope for Unicode-compliant operating systems that do not allow two different files to have equivalent names.)

Drive Letters

In Windows, as well as in related systems like DOS and OS/2, file system paths start with a drive letter, followed by a colon (e.g., the C: in C:\directory\file.txt). That drive letter (together with the following colon) makes up the first segment of file URLs on these systems (as in file:///C:/directory/file.txt). However, historically also a vertical bar has been used instead of the colon, as in file:///C|/directory/file.txt (note that the vertical bar is not escaped as %7C in this special case, even though it is an illegal character).

The OpenOffice.org code can handle both cases of file URLs (with either a colon or a vertical bar), but URLs generated by the code always follow the “standard” convention of using a colon.

Hosts

The file URL scheme allows for an optional host component after the file:// prefix. Specifying localhost (in any combination of upper and lower case letters) is the same as leaving that component empty: the URL locates a file system path on the current machine.

The intended use for this host component was to specify the DNS name (or IPv4/IPv6 address) of a machine, to indicate that the URL locates a file system path on that machine. The problem is that there is no protocol that details how such a remote file should be accessed, so interpretation of file URLs containing a host component was left unspecified.

On Unix, the OpenOffice.org code does not support file URLs with a host component. Windows, on the other hand, knows the concept of UNC paths, file system paths containing the name of a remote machine:

\\somewhere\somedir\file.txt

The machine names used in UNC paths (somewhere in the above example) have another structure than DNS names, so, strictly speaking, they could not be used in the host component of file URLs. Nevertheless, many applications on Windows, including the OpenOffice.org code, follow the convention of allowing UNC machine names within file URLs. This means that the above UNC path corresponds with the URL

file://somewhere/somedir/file.txt