Proposed changes for the next draft | |||
---|---|---|---|
name | title | type | status |
032-component-examples | add more examples for generic syntax components | examples | pending |
Issues already incorporated within the draft | |||
---|---|---|---|
name | title | type | status |
003-relative-query | inconsistent resolution of query-only relative URI | relative | fixed 00 |
004-pathless-base | resolution algorithm fails for base URI with no path | relative | fixed 00 |
006-absoluteURIref | need BNF term for absolute URI with optional fragment | bnf | fixed 00 |
007-empty-rel_path | relative URI syntax does not allow empty path | relative | fixed 00 |
014-empty-opaque_part | syntax does not allow "dav:" or "about:" as URI | opaque_part | fixed 00 |
015-fragment-handling | clarify how URI processor is expected to handle fragment | fragment | fixed 00 |
016-hostname-toplabel | hostname toplabel syntax could be improved | hostname | fixed 00 |
023-URI-plural | URI or URIs for plural | terminology | fixed 00 |
025-rel_segment | rel_segment is defined without distinguishing param | segment | fixed 00 |
026-ABNF | replace existing BNF with standard ABNF of RFC 2234 | bnf | fixed 00 |
011-IPv6-literal | integrate IPv6 syntax of RFC 2732 | IPv6 | added 00 |
012-simplify-IPv6 | change BNF to incorporate IPv6 better than RFC 2732 | IPv6 | added 00 |
018-IPv6-example | RFC 2732 example bug | IPv6 | added 00 |
027-ref-HTML | draft 00 contains an obsolete ref to RFC 1866 | references | fixed 01 |
028-ref-rfc0952 | draft 00 normative reference to RFC 952 | references | fixed 01 |
030-IPv6-bnf | draft 00 errors in IPv6 syntax | IPv6 | fixed 01 |
035-scheme-escaping | %HH escaping should not be scheme-dependent | characters | fixed 01 |
010-gethostbyname | gethostbyname allows much more than hostname BNF | hostname | added 01 |
029-decimal-IP | add security considerations for misleading use of decimal IP | security | added 01 |
037-uri-comparison | define how to compare URIs | characters | added 01 |
008-URIvsURIref | URI versus URI Reference | terminology | fixed 02 |
017-rdf-fragment | RDF does not believe in same-document references | fragment | fixed 02 |
019-URI-URL-URN | URI/URL/URN contemporary view | terminology | fixed 02 |
021-relative-examples | relative URI examples could be improved | examples | fixed 02 |
024-identity | Resource should not be defined as anything that has identity | terminology | fixed 02 |
031-query-def | query definition | query | fixed 02 |
033-dot-segments | relativising an absolute reference should be invertible | relative | fixed 02 |
038-qualified | qualified production in hostname is ambiguous | hostname | fixed 02 |
039-LALR-BNF | BNF should be more LALR-parser friendly | bnf | fixed 02 |
040-reg-name | Remove registry-based name syntax from authority | authority | fixed 02 |
041-encoding | Section 2 on encoding causes too much confusion | characters | fixed 02 |
042-fragment-when | fragment identifiers applied before entire content is retrieved | fragment | fixed 02 |
022-definitions | definitions for operations on URIs | terminology | added 02 |
Issues that will not be incorporated | |||
---|---|---|---|
name | title | type | status |
009-nullable-netpath | syntax for netpath allows empty authority | netpath | closed |
013-query-slash | slash character should be forbidden in query | query | closed |
020-utf8-default | Defaulting to UTF-8 for unknown encoding | characters | closed |
034-identifier | identifier is not just a sequence of characters | terminology | closed |
036-host-escaping | %HH escaping should be allowed on hostname | characters | closed |
001-file | file scheme implementations vary on use of authority component | scheme | postponed |
002-undefined-schemes | schemes from RFC 1738 need their own specs | scheme | postponed |
005-ftp | background on ftp extensions | scheme | postponed |
001-file | file scheme implementations vary on use of authority component |
---|---|
postponed | scheme |
report:
Charles C. Fu,
15 Jul 1998,
libwww-perl mailing list:
[under Windows] it's perfectly legal while on host "foo" to request file://server/folder/item. On Win32, and on other systems, this requests the "item" stored in "folder" on the "server" machine. On Win32, it magically works. Actually, it is illegal but happens to work with Explorer, does not work with Netscape under Windows, and may or may not work with other Windows clients. In general, the exact details of file URL handling is up to the client you're using. It's pretty uniform on UNIX systems but is NOT uniform amongst Windows clients. In particular, Netscape and Explorer handle file URLs differently under Windows. Here are some examples: - Netscape correctly handles escapes (like file:///c%3A/ for the C drive), but Explorer does not. - Netscape allows file:/// (which is empty), but Explorer does not. - Explorer allows file:///\\remotehost\share\ and file:////remotehost/share/, but Netscape does not. I'm sure there are other differences. [Windows Examples] file://c:/temp/test.txt => open (FH, "c:/temp/test.txt"); file://c:\temp\test.txt => open (FH, "c:\\temp\\test.txt"); file://localhost/c:/temp/test.txt => open (FH, "c:/temp/test.txt"); file://remotehost/c:/temp/test.txt is not legal Only the localhost example above is technically legal since host portions of file URLs must be fully qualified domain names, 'localhost', or empty. The second example is also illegal because a mandatory '/' must follow the host portion. For the details, see RFC1738 (Uniform Resource Locators). The first two examples can be made legal by writing them as <file:///c:/temp/test.txt>. This happens to work with both Explorer and Netscape. Again, be warned that it may or may not work with other Windows clients. As for UNC paths, I am not aware of a legal way to use them in file URLs which works with both Netscape and Explorer. |
002-undefined-schemes | schemes from RFC 1738 need their own specs |
---|---|
postponed | scheme |
report:
Larry Masinter,
09 Sep 1998,
URI-WG mailing list:
RFC 2396 obsoletes 1738, which contained: ftp File Transfer protocol http Hypertext Transfer Protocol gopher The Gopher protocol mailto Electronic mail address news USENET news nntp USENET news using NNTP access telnet Reference to interactive sessions wais Wide Area Information Servers file Host-specific file names prospero Prospero Directory Service Of these, 'http' and 'mailto' are covered by their own RFCs now, but 'ftp', 'news', 'telnet', 'file' should be re-issued. (It's OK with me if we leave 'gopher', 'wais', and 'prospero' behind.) 'ftp' has never been properly specified, as actually implemented. 'news' should be updated to merge 'news' and 'nttp' according to current practice, and 'file' needs a proper specification that handles things like volume names on the windows platform and suggests that other OS profiles should be developed for local name mapping. |
003-relative-query | inconsistent resolution of query-only relative URI |
---|---|
fixed 00 | relative |
report:
Miles Sabin,
23 Mar 1999,
private mail:
I've been working through the relative URI resolution mechanism in RFC 2396, and I've spotted something which seems a little odd. The example resolution on p.29 for, ?y from, http://a/b/c/d;p?q is given as, http://a/b/c/?y but as far as I can make out, the resolution algorithm suggests the result ought to be, http://a/b/c/d;p?y which is the result that was given in RFC 1808. It's also the result that both Netscape 4 and IE 4 deliver. Given that this would be an observable change in behaviour between the two RFCs, I'm a little surprised that it wasn't flagged up as such if the change really was intended ... Strangely enough, Sun's badly broken java.net.URL class _does_ give the result specified in 2396, which makes me suspect that something must be wrong ;-) |
|
report:
Henry Holtzman,
09 Jul 2002,
private mail:
rfc2396 specifies a different browser behavior from rfc1808 in a particular situation that I believe may be unintentional. IE & Netscape implement the rfc1808 behavior while Opera implements the rfc2396 behavior. As appendix G of rfc2396 makes no mention of this change, we would appreciate your opinion on the matter. In rfc1808, when the relative URL has no path component, but has a fragment or a query, the client is supposed to skip step 6 of forming the absolute URI. In step 6, among other things, the base URI is stripped of all characters beyond the final "/". In rfc2396, when the relative URI has no path and has a fragment, it is specified that processing should be stopped as no new document should be loaded, but rather navigation within the document is specified. This change is explained in appendix G. However, when there is no path component, but there is a query component, processing continues. The instruction to skip stripping the post-final-/-characters is gone in rfc2396, which means that the final part of the base URI is stripped and so the query is not performed on the same page as was loaded (unless that page's URI ended with a "/". Was this change between rfc1808 and rfc2396 intended? The following small php application illustrates the issue. You can run it at http://www.media.mit.edu/opera/r-url.php. You will note that Opera (6.03) behaves very differently from Netscape and IE when executing this page. With IE and Netscape, you can navigate within the application. With Opera, when you click on the links within the app, you get an index page of the directory containing the app. It is my belief that the final characters should *not* be stripped, and that rfc2396 should be amended to skip the stripping in the case of a relative URI with only a query component. <html> <head> <title>Example application using empty path relative URLs</title> </head> <body> <h4>Example application using empty path relative URLs</h4> <?php if ($action=="here") { ?> Thank you for clicking here!<br><br> <?php } else if ($action=="there") { ?> Hey, you weren't supposed to click there!<br><br> <?php } ?> Please click <a href="?action=here">here</a>.<br> Please do not click <a href="?action=there">there</a>.<br> <br> Thank you. </body> </html> |
|
action:
Roy T. Fielding,
14 Oct 2002,
draft 00:
Fixed by rewriting the algorithm as pseudocode and restoring the original RFC 1808 behavior, with the example changed accordingly. |
004-pathless-base | resolution algorithm fails for base URI with no path |
---|---|
fixed 00 | relative |
report:
Ronald Tschalär,
16 Sep 1999,
private mail:
I tried to follow the algorithm in my implementation, but it gives http://ab :-( I'm doing: Input: base: scheme = `http', authority = `a', path = `', query undefined reference: `b' Step 1): path = `b'; scheme, authority, query are undefined Step 2): is a nop Step 3): scheme = `http' Step 4): authority = `a' Step 5): doesn't apply Step 6): a) gives buffer = `' b) gives buffer = `b' c) - g) don't apply h) gives path = `b' Step 7): says `http' + `:' + `//' + `a' + `b' |
|
report:
Adam M. Costello,
21 Apr 2000,
private mail:
I think there's a slight bug in the relative URI resolution algorithm in RFC 2396. Consider: Base URI = http://foo.com URI-reference = bar As far as I can tell, the algorithm yields: http://foo.combar This base URI is allowed according to the statement in section 5.2: Note that only the scheme component is required to be present in the base URI; the other components may be empty or undefined. Here's a walk through the algorithm: step 1: parse reference (no problem) step 2: query/fragment not inherited from base (no problem) step 3: scheme inherited from base (no problem) step 4: authority inherited from base (no problem) step 5: reference is not absolute (no problem) step 6a: base URI's path (which is undefined) is copied into buffer (So the buffer is empty? This may be part of the problem.) step 6b: "bar" is appended to the buffer (which now contains "bar") step 6c: remove ./ (no-op) step 6d: remove trailing . (no-op) step 6e: remove segment/../ (no-op) step 6f: remove trailing segment/.. (no-op) step 6g: check for leading .. (none found) step 6h: buffer is the new path ("bar") step 7: result = "" append "http" append ":" append "//" append "foo.com" append "bar" (No check for initial slash, this may be part of the problem.) return "http://foo.combar" Presumably the desired absolute URI is http://foo.com/bar. Possible ways to achieve this include: 1) Alter step 6a to initialize the buffer to "/" if the base URI has no path. 2) Alter step 7 to insert a slash before any path that does not begin with a slash (including an empty path). 3) Alter step 7 to insert a slash before any path that begins with a non-slash (but not before an empty path). I think proposals 1 and 2 are equivalent, but I haven't considered it carefully. Proposal 3 gives a different result if the reference is "./" and the base URI has no path. Proposal 1 looks the cleanest to me. |
|
action:
Roy T. Fielding,
17 Sep 1999,
private mail:
I guess step 6a should be a) All but the last segment of the base URI's path component is copied to the buffer. In other words, any characters after the last (right-most) slash character, if any, are excluded. If the base URI's path component is the empty string, then a single slash character ("/") is copied to the buffer. |
|
action:
Roy T. Fielding,
14 Sep 2002,
draft 00:
Fixed as described above. |
005-ftp | background on ftp extensions |
---|---|
postponed | scheme |
report:
Gregory A Lundberg,
9 Dec 1999,
Apache httpd dev mailing list:
If you've already done any server-side commands, you should take a look at the current specification and consider re-implementing them if you want any clients to use them. http://www.wu-ftpd.org/rfc/draft-ietf-ftpext-mlst-09.txt or ftp://ftp.ietf.org/internet-drafts/draft-ietf-ftpext-mlst-09.txt MIME types are a "Standard Fact". They may or may not be present. If present, they must conform to the IANA-approved list of type names. While you're at it, you should notice that language negotiation is, too some extent, also possible. For this, in addition to the MLST draft, you should also take a look at RFC 2640, "Internationalization of the File Transfer Protocol". The site http://www.wu-ftpd.org/rfc/ contains a complete list of the FTP RFCs. (Well, nearly complete. I'm told there's another URL RFC I should include.) If you don't want to browse the site, or have a local mirror of the RFCs, the complete list of current RFCs which define the FTP is: 959, 1123, 1579, 1635, 1738, 1808, 2228, 2415, 2428, 2577 and 2640. The MLST draft just underwent a major change (splitting a feature out for a separate draft). Other than that, it is fairly mature and should be progressing to submission to the RFC Editor. The other FTP-related IETF drafts have, by now, expired and are not expected to progress to submission. |
006-absoluteURIref | need BNF term for absolute URI with optional fragment |
---|---|
fixed 00 | bnf |
report:
Dan Connolly,
10 Jan 2000,
URI-WG mailing list:
I have recently spent a considerable amount of time studying the URI spec [1] http://www.ietf.org/rfc/rfc2396.txt and I discovered, somewhat to my surprise, that it defines the terms "URI reference" and "absolute URI" very precisely, but (a) it doesn't define the term "URI", syntactically (!!!) and (b) it doesn't give a term for an absolute-URI-with-optional-fragment-id , i.e. the result of combining a URI reference with an absolute URI. This is pretty awkward, since an absolute-URI-with-optional-fragment-id is really what we meant when we wrote "URI reference" in: "An XML namespace is a collection of names, identified by a URI reference" -- http://www.w3.org/TR/1999/REC-xml-names-19990114/#sec-intro We used "URI reference" because "absolute URI" excludes fragment identifiers, and we wanted http://example.net/#vocab to be a valid namespace identifier. But ../xyz/ isn't a namespace identifier, until you combine it with a base absoluteURI. Another example: "The locator attribute provides a URI-reference that identifies a remote resource (or sub-resource)" -- http://www.w3.org/TR/1999/WD-xlink-19991220/#Local Resources for an Extended Link URI-references don't identify remote resources; absoluteURIs do. The "or sub-resource" makes it clear that the author intends to allow #fragids. So again, what's needed is a term for absolute-URI-with-optional-fragment-id. It was called fragmentaddress in RFC1630. If formal systems float your boat, you can take a look at my formalism of this stuff in larch: http://www.w3.org/XML/9711theory/URI http://www.w3.org/XML/9711theory/URI.html (HTML version with nasty hacks for math symbols) http://www.w3.org/XML/9711theory/URI.lsl (original ascii LSL version) part of "Specifying Web Architecture with Larch" http://www.w3.org/XML/9711theory/ which gives pointers explaining larch etc. I used the term URIwf for absolute-URI-with-optional-fragment-id, and I used absoluteURI and URI_reference with their rfc2396 meanings. |
|
action:
Roy T. Fielding,
27 Oct 2002,
draft 00:
absolute-URI-reference has been added to the section on URI reference and the ABNF. |
007-empty-rel_path | relative URI syntax does not allow empty path |
---|---|
fixed 00 | relative |
report:
Reese Anschultz,
17 Feb 2000,
private mail:
I have an observation regarding section -- "C. Examples of Resolving Relative URI References" -- within this document. The document cites that given the well-defined base URI of http://a/b/c/d;p?q relative URI ?y would be resolved as follows: http://a/b/c/?y By my interpretation from the BNF, a query can exist as either relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] or hier_part = ( net_path | abs_path ) [ "?" query ] Since net_path, abs_path and rel_path must each be a least one character in length, I believe that the example "?y" is not a valid URI because no characters proceed the question mark (?). |
|
report:
Henry Zongaro,
12 Nov 2001,
RFC editor:
Appendix C shows an example of a relative URI Reference of "?y" with respect to the base URI "http://a/b/c/d;p?q". However, according to the collected syntax that appears in Appendix A, "?y" doesn't appear to be a valid relative URI reference. The syntactic category URI-reference must begin with an absoluteURI, a relativeURI or a pound sign. An absoluteURI begins with a scheme, which cannot begin with a question mark; a relativeURI begins with a net_path or abs_path, both of which begin with a slash, or with a rel_path. A rel_path begins with a non-empty rel_segment, which again cannot begin with a question mark. |
|
report:
Bruce Lilly,
16 Jan 2002,
private mail:
Section C.2 mentions an empty reference, but the formal syntax does not provide for that. There are several possible changes to the formal syntax which would permit it, e.g. change 1* to * in the definition of rel_segment, which would permit an empty rel_path and therefore relativeURI (however, it would then permit a relativeURI consisting of "?" query, which might not be desired). Alternatively, the entire RHS of the relativeURI definition could be bracketed, i.e. made optional, which would permit an empty relativeURI without permitting a lone delimited query. |
|
action:
Roy T. Fielding,
20 Mar 2000,
private mail:
I don't even remember making this change, but it was broken when draft-fielding-uri-syntax-02.txt changed from rel_path = [ path_segments ] [ "?" query ] to (in 03): rel_path = rel_segment [ abs_path ] rel_segment = 1*( unreserved | escaped | ";" | "@" | "&" | "=" | "+" | "$" | "," ) |
|
action:
Roy T. Fielding,
14 Sep 2002,
draft 00:
Fixed by making the path optional in the ABNF: 2396: relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] hier_part = ( net_path | abs_path ) [ "?" query ] draft-00: relative-URI = [ net-path / abs-path / rel-path ] [ "?" query ] hier-part = [ net-path / abs-path ] [ "?" query ] |
008-URIvsURIref | URI versus URI Reference |
---|---|
fixed 02 | terminology |
report:
Larry Masinter,
26 May 2000,
xml-uri mailing list:
When we update RFC 2396, I suggest we add an introductory paragraph explaining that the term "URI" is used ambiguiously in the community to mean "a URI reference" (corresponding to the URI-reference BNF entity) or "an absolute URI", and that for this reason, the term "URI" itself is not defined in the document. I'd probably fix the Abstract correspondingly, e.g., "Informally, a Uniform Resource Identifier is a compact string...." so that people don't think that the abstract is normative. |
|
report:
Jeff Hodges,
01 Jun 2001,
URI-WG mailing list:
It seems to me, in considering points raised in the "Are URI-References bound to resources?" thread, that some subtleties might be a bit more clear if changes along the following lines were made to RFC 2396 (i.e. in a future revision of that doc, if any).. 4. URI References The term "URI-reference" is used here to denote the common usage of a ^^^^ ^^^^^^^^^^^^^^^ ^ production (delete) s resource identifier. A URI reference may be absolute or relative, ^ The term "URI reference" is a casual (i.e. natural language) description for artifacts that are parsable using the "URI-reference" production. and may have additional information attached in the form of a fragment identifier. However, "the URI" that results from such a reference includes only the absolute URI after the fragment identifier (if any) is removed and after any relative URI is resolved to its absolute form. Although it is possible to limit the discussion of URI syntax and semantics to that of the absolute result, most usage of URI is within general URI references, and it is impossible to obtain the URI from such a reference without also parsing the fragment and resolving the relative form. URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (delete) add: URI = absoluteURI | relativeURI add: URI-reference = [ URI ] [ "#" fragment ] . . . It seems to me that the above suggested re-write of the URI-reference production, and the additions to the preceding text, would make it easier and clearer to talk about "URI" artifacts and "URI-reference" artifacts and their different abstract semantics. Also, the _term_ "URI reference" isn't defined prior to section 4 (wherein it is only tangentially defined, imho). Terms that are also used in sections prior to section 4 whose explicit definition would help the document convey it's rather abstract notions to the reader are: "document" and "reference". Explicitly defining how those terms are used and what their semantics are in the context of URI and URI-reference artifacts are, would be immensely helpful to readers. |
|
report:
Tim Berners-Lee,
23 Jan 2003,
URI-WG mailing list:
I would very much like us to take the opportunity to clean up the terminology on the URI spec which has confused people. It is my considered opinion that this would be far preferable: URI - the actual identifier string, with or without a #fragid. URI reference - a string used in a language to specify a URI, for which relative form may be used where a base exists. ((This is not the only way of specifying the value of a URI - one can use various character sets, namespace prefixes, etc)) |
|
action:
Roy T. Fielding,
23 May 2003,
draft 02:
An ABNF production for URI has been introduced to correspond to the common usage of the term: an absolute URI with optional fragment. The fragment identifier has been moved back into the section on generic syntax components and within the URI and relative-URI productions, though it remains excluded from absolute-URI. The entire text of the specification has been revised accordingly. |
009-nullable-netpath | syntax for netpath allows empty authority |
---|---|
closed | netpath |
report:
Kohsuke Kawaguchi,
15 Mar 2001,
private mail:
I found that according to BNF of RFC 2396 "URI Generic Syntax", the following string is accepted as a valid URI. "http://12345.678/" I assumed this should be rejected because substring "12345.678" does not match hostname production of BNF. However, actually this string is accepted by the following derivation. absoluteURI - scheme ":" hier_part - "http" ":" abs_path - "http:" "/" path_segments - "http:/" segment "/" segment "/" - "http:/" *pchar "/" *pchar "/" - "http:/" "/" "12345.678" /" - "http://12345.678/" As you see, the fact that segment is nullable makes net_path production meaningless. Is this the intention of authors? Or should it be considered as a bug in BNF? If so, is it appropriate to fix this bug by changing segment as follows? segment = 1*pchar *( ";" param ) |
|
action:
Roy Fielding,
17 Oct 2002,
issues list:
That URI is valid (maybe not for http, but for the URI syntax in general). The generic syntax requires that the components be extracted first in order to disambiguate these cases (the greedy rule). Only after the components are extracted can the syntax of those components be tested for correctness. |
|
report:
James Clark,
20 Jul 2001,
URI-WG mailing list:
Is "foo://" a legal URI in RFC 2396? If so, is the path componebnt "//" or empty? On the one hand, "//" doesn't parse as net_path so it parses unambigously as an abs_path, so the disambiguating rule in 4.3 is arguably not applicable. This would suggest it is legal, and the path component is "//". On the other hand, if you use the regex in appendix B, the // will be treated as an empty authority component (which is not legal) rather than as a path component. Maybe the regex should use //([^/?#]+) instead of //([^/?#]*) so that the regex splits things consistently with the grammar. Alternatively, reg_name could be changed so that it matches the empty string, so that // would parse as a net_path, and hence there would be an ambiguity to which 4.3 could be applied, and the existing regex would be consistent. |
|
action:
Larry Masinter,
11 Aug 2001,
private mail:
I just looked at this again, and an empty authority is fine; it turns out to look like an empty 'server', rather than an empty 'regname'. server = [ [ userinfo "@" ] hostport ] So "//" does parse as net_path, and the regex in appendix B is fine. |
010-gethostbyname | gethostbyname allows much more than hostname BNF |
---|---|
added 01 | hostname |
report:
Tomas Rokicki,
02 Jun 2001,
URI-WG mailing list:
RFC 2396 contains the following BNF for the host part of a URI: host = hostname | IPv4address hostname = *( domainlabel "." ) toplabel [ "." ] domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum toplabel = alpha | alpha *( alphanum | "-" ) alphanum IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit port = *digit Typical implementations use // and / to locate the hostport part, and break things apart and use gethostbyname() to resolve the IP address. Gethostbyname() has quite a different syntax, however, allowing IP addresses such as http://63.197.151.31/ (as above; class C syntax) http://63.197.151.037/ (leading zero means octal, but still within the BNF of above) http://63.197.38687/ (two-dot notation; class B syntax) http://63.12949279/ (one-dot notation; class A syntax) http://1069913887/ (numeric IP syntax) and of course all combinations of above, including http://07761313437/ (octal) http://000000077.0000000305.000000000227.00000000037/ (leading zeros) I have two points. First, the implementations are out of sync with the specification. Does this matter? Secondly, one can argue that the implied semantics of the BNF given above for a four-dot representation is a decimal interpretation, where the implementations use octal of any component of the IP address begins with a leading zero (unlike what happens for the port, where http://63.197.151.31:0000000080/ accesses port 80). |
|
action:
Roy T. Fielding,
01 Mar 2002,
draft 01:
Added to the Security Considerations for draft 01. |
011-IPv6-literal | integrate IPv6 syntax of RFC 2732 |
---|---|
added 00 | IPv6 |
report:
Larry Masinter,
01 Dec 1999,
private mail:
http://www.ietf.org/rfc/rfc2732.txt |
|
action:
Roy T. Fielding,
26 Oct 2002,
draft 00:
IPv6 literals have been added to the list of possible identifiers for the host portion of a server component, as described by RFC 2732, with the addition of "[" and "]" to the reserved, uric, and uric-no-slash sets. Square brackets are now specified as reserved for the authority component, allowed within the opaque part of an opaque URI, and not allowed in the hierarchical syntax except for their use as delimiters for an IPv6reference within host. In order to make this change without changing the technical definition of the path, query, and fragment components, those rules were redefined to directly specify the characters allowed rather than continuing to be defined in terms of uric. Since RFC 2732 defers to RFC 2373 for definition of an IPv6 literal address, which unfortunately has an incorrect ABNF description of IPv6address, I created a new ABNF rule for IPv6address that matches the text representations defined by Section 2.2 of RFC 2373. Likewise, the definition of IPv4address has been improved in order to limit each decimal octet to the range 0-255. |
012-simplify-IPv6 | change BNF to incorporate IPv6 better than RFC 2732 |
---|---|
added 00 | IPv6 |
report:
James Clark,
20 Jul 2001,
URI-WG mailing list:
The XML schema anyURI simple type allows any string which after escaping disallowed characters as described in Section 5.4 of XLink is a URI reference as defined in RFC 2396, as amended by RFC 2732. This raises the question of what exactly it takes for an implementation to check this. Putting on one side the RFC 2732 amendments (and the consequent non-escaping of square brackets by the XLink algorithm), I believe it's very simple. To check a string, do the following: 1. Check that every % is followed by two hex digits. 2. Check that there is at most one # character in the string. 3. If the string contains a ":" character that precedes all "/", "?" and "#" characters, then the string is an absolute URI and the substring preceding the first such colon must match the regex [a-zA-Z][-+.a-zA-Z0-9]*. 4. If the string is an absolute URI (as in 3), the the first colon must not be immediately followed by a # or the end of the string. (For example, "foo:" and "foo:#bar" are illegal.) I think that's it. It's not straightforwatd to deduce this from RFC 2396 and XLink, so I am not 100% confident. RFC 2732 seems to radically complicate things. It adds "[" and "]" to the set of reserved characters and removes them from unwise. This has the effect of allowing square brackets in the query component and the fragment component. The first problem arises with the path component. Since pchar is defined in RFC 2396 as unreserved | escaped | ":" | "@" | "&" | "=" | "+" | "$" | "," it is unaffected by RFC 2732 and thus square brackets are not allowed in the path component. This is a little bit strange, since intuitively pchar is an any uric other than "/", "?" and ";", but it complicates checking only a little. The big problem is with the authority component. Before RFC 2732, checking generic URI syntax did not require any complex parsing of the authority component, because an authority can be a reg_name, which allows one or more of any uric other than "/" and "?". The problem is that because reg_name is defined as: 1*( unreserved | escaped | "$" | "," | ";" | ":" | "@" | "&" | "=" | "+" ) it is unaffected by RFC 2732. Thus square brackets are not allowed to appear arbitrarily in the authority component, but can only appear if the authority component matches the server production (as amended by RFC 2732). This means that a generic URI checker now has to do a complex parse of the authority component. This seems completely at variance with the intent of section 3.2.1 of RFC 2396: "The structure of a registry-based naming authority is specific to the URI scheme, but constrained to the allowed characters for an authority component." I would therefore suggest at a mininum that RFC 2732 should be fixed to allow "[" and "]" in reg_name. I also think it would be cleaner and more in harmony with RFC 2396 to also allow them in the path component. In terms of the BNF I would suggest introducing an other_reserved symbol: other_reserved = "&" | "=" | "+" | "$" | "," | "[" | "]" Then in each place in RFC 2396 replace occurrences of "&" | "=" | "+" | "$" | "," (specifically in uric_no_slash, rel_segment, reg_name, userinfo, pchar, reserved) by a reference to other_reserved. I believe this would also make the BNF in RFC 2396 easier to understand. |
|
report:
Grégoire Vatry,
04 Apr 2002,
private mail:
I report what I suspect to be an error in RFC 2732 which updates RFC 2396. I suspect that 'uric_no_slash' set of characters has been forgotten in the list of changes made to the URI generic syntax by RFC 2732. Here is my line of argument: Since: 1. The set 'uric_no_slash' stands for "same as 'uric' BUT without slash"; 2. The set 'uric' is defined as: uric = reserved | unreserved | escaped 3. Slash ("/") is part of 'reserved' set; 4. Set of 'reserved' characters is modified in RFC 2732. As a result, point (3) of section 3. in RFC 2732 should be: (3) Add "[" and "]" to both the set of 'reserved' characters and the 'uric_no_slash' set: reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," | "[" | "]" uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," | "[" | "]" and remove them from the 'unwise' set: unwise = "{" | "}" | "|" | "\" | "^" | "`" |
|
action:
Brian E. Carpenter,
04 Apr 2002,
private mail:
This indeed appears to be an oversight, thanks. Larry Masinter is thinking about combining these two RFCs in their next update so this needs to go on his list. |
|
action:
Larry Masinter,
04 Apr 2002,
URI-WG mailing list:
I agree that this is an error in RFC 2732, and should be folded in when we merge RFC 2732 with RFC 2396. We would need two independent interoperable implementations of RFC 2732 (with ipv6 addresses), though. |
|
action:
Roy T. Fielding,
22 Oct 2002,
issues list:
Adding square brackets to uric_no_slash is fine, since it only affects the opaque URI syntax. However, adding it to the other places that James Clark suggested would allow square brackets to be used anywhere, which is simply unwise (and why they were not allowed at all before). I can understand why IPv6 chose square brackets as delimiters, but allowing them in path, query, and fragment would cause too many interoperability issues with deployed systems. |
|
action:
Roy T. Fielding,
26 Oct 2002,
draft 00:
IPv6 literals have been added to the list of possible identifiers for the host portion of a server component, as described by RFC 2732, with the addition of "[" and "]" to the reserved, uric, and uric-no-slash sets. Square brackets are now specified as reserved for the authority component, allowed within the opaque part of an opaque URI, and not allowed in the hierarchical syntax except for their use as delimiters for an IPv6reference within host. In order to make this change without changing the technical definition of the path, query, and fragment components, those rules were redefined to directly specify the characters allowed rather than continuing to be defined in terms of uric. |
013-query-slash | slash character should be forbidden in query |
---|---|
closed | query |
report:
A. Carl Douglas,
26 Apr 2001,
RFC editor:
Section 3.4, "Query Component", of RFC2396 (URI syntax) refers to the "/" character as being reserved. Reserving this character creates an inconsistency for some of today's web servers, which confuse part of the Query Component as being part of the Path Component when the "/" character is present in the Query Component. The "/" character should only be permitted in the Path Component of a URI, and elsewhere in the URI it should be escaped by using it's hex value. |
|
action:
Roy T. Fielding,
24 May 2001,
private mail:
This is not an error in the spec, though it could be useful as a note in future revisions. The specification cannot disallow characters that commonly do appear in a URI query string, even if it is inadvisable for them to be used. That is why they are listed as reserved in that context (i.e., should not be used unencoded except when the reserved meaning is intended). |
014-empty-opaque_part | syntax does not allow "dav:" or "about:" as URI |
---|---|
fixed 00 | opaque_part |
report:
Julian Reschke,
19 Nov 2001,
WebDAV-WG mailing list:
(1) RFC2518 (WebDAV) is based on XML + namespaces and has chosen to use the namespace name "DAV:" to identify it's elements. Note that "DAV:" *is* a properly registered URI scheme) (2) The XML namespaces recommendation says that an XML namespace is identified by a URI reference as defined in RFC2396. (3) RFC2396 gives the following grammar for absolute URIs: absoluteURI = scheme ":" ( hier_part | opaque_part ) opaque_part = uric_no_slash *uric "DAV:" doesn't seem to be a valid "opaque_part", because "opaque_part" MUST start with "uric_no_slash", thus it may not be empty. (4) I became aware of this mismatch when trying to develop a RELAG NG schema for WebDAV. James Clark's JING validator rejects the namespace name "DAV:" as invalid URI. So this has become a real-world problem (maybe it was "just" academic before). |
|
action:
Roy T. Fielding,
24 May 2001,
private mail:
will fix BNF |
|
action:
Roy T. Fielding,
14 Sep 2002,
draft 00:
Fixed by making the path optional in the BNF: 2396: relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] hier_part = ( net_path | abs_path ) [ "?" query ] draft-00: relative-URI = [ net-path | abs-path | rel-path ] [ "?" query ] hier-part = [ net-path | abs-path ] [ "?" query ] |
015-fragment-handling | clarify how URI processor is expected to handle fragment |
---|---|
fixed 00 | fragment |
report:
Jason Diamond,
11 Jan 2002,
URI-WG mailing list:
I'm gathering you want resolveURI to take any URI ref and return an absolute URI reference. Instead, what I would do is define resolveURI as a function that takes any URI-reference-up-to-but-not-including-the-fragment-id and returns the appropriate absolute URI. The fragment id part is never sent to resolveURI and is always re-appended to what resolveURI returns. I based my implementation on the example algorithm in Section 5.2. Despite being titled "Resolving Relative References to Absolute Form", it does cover non-relative URI references (see step 3). Step 2 covers the case where the URI reference is the empty string or just a fragment identifier. In that case, it states the the reference is a "reference to the current document and we are done". Hmm. Looking at this paragraph again, I now think that it might be slightly flawed. It says "and we are done". It doesn't mention that the fragment identifier, if present, should be appended to the URI of the current document. In this model, if resolveURI is handed a null string, it just returns a null string and the calling code would know to use the fragment id to access into the current resource without anyone having to talk about a document URI (which may not exist if, say, you're working on some in-memory view of a dynamic document--and even if there is such a URI, you wouldn't want to use the URI to do a fetch of the document that is the current one anyway). I'm fairly certain that my implementation will produce the correct result as would the model that you suggest above. It passes all of the tests in Appendix C. I'm actually working on an RDF parser (in XSLT) so am not fetching any resources but I do need to convert all URI references to their absolute form and would like that encapsulated into a single function. |
|
action:
Roy T. Fielding,
14 Oct 2002,
draft 00:
Fixed by rewriting the algorithm as pseudocode. |
016-hostname-toplabel | hostname toplabel syntax could be improved |
---|---|
fixed 00 | hostname |
report:
Bruce Lilly,
16 Jan 2002,
private mail:
I believe that there is a discrepancy between 3.2.2 and the DNS specifications referenced there. The definition in 3.2.2 for hostname is: hostname = *( domainlabel "." ) toplabel [ "." ] domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum toplabel = alpha | alpha *( alphanum | "-" ) alphanum That permits a lone toplabel as the hostname, which could of course apply to the URI "http://localhost". The definitions of domainlabel and toplabel appear to be consistent with the DNS specifications, as amended by RFC 1123 (but with the proviso that the length limits specified by DNS are missing), but I believe that there are some problems with the definition of hostname in terms of those tokens. In particular, the semantics of the above example differ from what is implied by the name "toplabel". The syntax permits URIs like "http://localhost." and "http://edu", which don't seem quite right, and it forbids "http://1xyz", where "1xyz" is a valid unqualified host name (in the DNS sense). I believe that a more consistent (with DNS and the text of sect. 3.2.2) definition of hostname syntax would be: hostname = domainlabel [ *( "." domainlabel ) "." toplabel [ "." ] ] Does that seem reasonable? The grouping within the specifications of domainlabel and toplabel could be clarified by parenthesization: domainlabel = alphanum | ( alphanum *( alphanum | "-" ) alphanum ) toplabel = alpha | ( alpha *( alphanum | "-" ) alphanum ) or equivalently but more compactly as: domainlabel = alphanum [ *( alphanum | "-" ) alphanum ] toplabel = alpha [ *( alphanum | "-" ) alphanum ] |
|
action:
Roy T. Fielding,
28 Oct 2002,
draft 00:
Changed to reflect all of the suggestions: hostname = domainlabel [ qualified ] qualified = *( "." domainlabel ) [ "." toplabel "." ] domainlabel = alphanum [ 0*61( alphanum / "-" ) alphanum ] toplabel = alpha [ 0*61( alphanum / "-" ) alphanum ] alphanum = ALPHA / DIGIT |
017-rdf-fragment | RDF does not believe in same-document references |
---|---|
fixed 02 | fragment |
report:
Jeremy Carroll,
10 Apr 2002,
URI-WG mailing list:
This is a comment about RFC 2396 that I have been actioned to send on behalf of the W3C RDF Core Working Group [1] The key issue concern resolving same document references and/or resolving against non-hierarchical URIs. These have been causing us difficulty in using xml:base As one of our deliverables we produce test cases [2]. A summary table of our URI resolution problems is as follows; the answers we have agreed are in the attached HTML file. EASY: a "http://example.org/dir/file" "../relfile" b "http://example.org/dir/file" "/absfile" c "http://example.org/dir/file" "//another.example.org/absfile" GETTING HARDER: d "http://example.org/dir/file" "../../../relfile" e "http://example.org/dir/file" "" f "http://example.org/dir/file" "#frag" MASTER CLASS: g "http://example.org" "relfile" h "http://example.org/dir/file#frag" "relfile" i "http://example.org/dir/file#frag" "#foo" j "http://example.org/dir/file#frag" "" k "mailto:Jeremy_Carroll@hp.com" "#foo" l "mailto:Jeremy_Carroll@hp.com" "" m "mailto:Jeremy_Carroll@hp.com" "relfile" We have reached consensus on and approved all these tests except for the last which some of us consider an error and others resolve as indicated in the html file. The rationales for our views are approximately as follows: d "http://example.org/dir/file" "../../../relfile" [[[RFC2396 In practice, some implementations strip leading relative symbolic elements (".", "..") after applying a relative URI calculation, based on the theory that compensating for obvious author errors is better than allowing the request to fail. ]]] Not permitted in RDF/XML. e,f,i,j,k,l Base does apply to same document references in RDF/XML g Failure to insert / is a bug with RFC 2396 h,i,j Strip frag id from base uri ref before resolving. Notice j is particularly surprising. k,l Same document reference resolution even works for non-hierarchical uris. m - no consensus The test suite is structured as follows: The positive tests on the test cases web site show a usage of xml:base in RDF/XML and the resolution of that usage in terms of the RDF graph produced (with absolute URI ref labels). Each test consists of two files, an RDF/XML document and an n-triple file (substitute .rdf with .nt in the URL), being a list of the edges of the graph. The negative test case shows possibly illegal usage of xml:base in RDF/XML. [1] http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Apr/0008.html [2] http://www.w3.org/2000/10/rdf-tests/rdfcore/xmlbase/ |
|
report:
Jeremy Carroll,
15 Apr 2002,
URI-WG mailing list:
I do not recall the RDF Core WG having resolved a justification of the decision in favour of the these test cases. Hence I will give my own justification. First: The actual decisions of the RDF Core WG reflect what 'same document references' mean within an RDF/XML document within the scope of an xml:base attribute. Primarily the WG decisions reflect the meaning of RDF/XML rather than XML Base of RFC 2396. However, these decisions do point to weaknesses in RFC 2396. The RDF Core WG has consistently (with or without xml:base) interpreted all uri references as absolute uri references. The decisions clarify that when the normal uri resolution mechanisms deliver a same document reference, we form the absolute uri ref using the currently in scope xml:base uri. Second: The definition of same-document references is unfortunately focussed on browsing: [[[ 4.2. Same-document References A URI reference that does not contain a URI is a reference to the current document. In other words, an empty URI reference within a document is interpreted as a reference to the start of that document, and a reference containing only a fragment identifier is a reference to the identified fragment of that document. Traversal of such a reference should not result in an additional retrieval action. However, if the URI reference occurs in a context that is always intended to result in a new request, as in the case of HTML's FORM element, then an empty URI reference represents the base URI of the current document and should be replaced by that URI when transformed into a request. ]]] line 3 "start of that document" is meaningless for an RDF document. RDF is a graph and is not a linear structure. line 6 "no additional retrieval action" All URIrefs in RDF are absolute, and none are retrieved accept when the application content "is always intended to result in a new request". The RDF Core is trying to clarify which absolute URI ref corresponds to a same document ref. line 9 The answer, at least for empty same document refs, it is the "base URI". We discover what a base URI is in section "5.1 Establishing a Base URI" [[[ 5.1. Establishing a Base URI The term "relative URI" implies that there exists some absolute "base URI" against which the relative reference is applied. Indeed, the base URI is necessary to define the semantics of any relative URI reference; without it, a relative reference is meaningless. In order for relative URI to be usable within a document, the base URI of that document must be known to the parser. ]]] I note that the algorithm in 5.2. Resolving Relative References to Absolute Form amongst its defects, does not implement line 9 of section 4.2. Once we are dynamically changing the xml:base from one element to the next, we are outside the design bounds of RFC 2396. If we consider only documents with a single xml:base on their outermost elements, then as far as RDF goes, the resolution of the same document test cases is consistent with section 4.2 of RFC 2396. A same document reference, like any uri ref, in an RDF file means an absolute URI ref. The absolute URI ref is formed by taking "the base URI" of the document, as suggested in line 9 of 4.2. The fragment part if taken from the same document reference. |
|
report:
Al Gilman,
15 Apr 2002,
URI-WG mailing list:
The bad news: In fact, "the same document" in fragment-only relative references should be taken even more locally and particularly than "the URI from which this representation was recovered." The latter reading is inadequate, an error. It should be read as "this representation." So the type is known, and with it the semantics of #fragment references. Without recourse to _even_ the URI from which it was recovered. As Paul suggested. For hyperlinks with goTo semantics, where the absolute URI equivalent of the reference is unnecessary, it is moot and therefore not defined. The best available absolute reference (nearest to equivalent) would be base-ified using the URI from which this representation was recovered, but that question has no need and no standing in the case of following hyperlinks in browsing the same "recovered representation." There is no general answer, absent a universal document type (see next). The good news: The semantics of #fragment in "the current document" is governed by the _type_ of the recovered represetation of the URI accessed. So for RDF to apply the semantic constraint that a #fragment reference is equivalent to a given absolute URI -- within a representation which belongs to a type which by its type definition is bound to the constraints of the RDF model -- is entirely within the purview of the specification of the RDF model and the languages in which it is represented. This violates the universality goal that any URI-reference can be used in any place a URI-reference can be used, but that is a different matter. This is also violated by having some references take anyURI and others limited to IDREF in the same document. The RDF restriction to absolute-URI-reference senses for fragment-URI-reference signs does not violate RFC-2396, at least. This is just that the RDF model only admits of 'absolute' references. So references in any syntax binding of the RDF model will only contain 'absolute' URI-references. |
|
report:
Brian McBride,
15 Apr 2002,
URI-WG mailing list:
First: the problem RDF is trying to solve. The current RDF specs have encouraged the use of the following idiom: <rdf:Description rdf:about="#foo"> ... The value of the rdf:about attribute is turned into an absolute URI reference by concatenating the '#foo' with the URI of the containing document. This causes problems. Folks copy the file from the web to their hard drive so they can work on it in a plane, and the uri changes to something like file:c:\temp\....rdf and this is really useless for rdf users. Or folks wish to include RDF in say a message protocol where there is no base uri of the document. This is the cause of one of, if not the, most frequent newbie problem with DAML that we see on jena-dev. So we are looking for a way to retain this convenient syntax, but have the uri's produced not change when the file is copied or mirrored. To appreciate what is happening here, we need to look at a semi-fictional RDF processing pipeline: input xml document -- xml parser -- rfc2396 processor -- rdf parser -- rdf graph We start with an xml document and end up with a datastructure. The datastructure is not a DOM; its not a representation of an xml document. It is as far as xml is concerned, an application data structure. For each value of an rdf:about attribute, the rfc2396 processor outputs either an absolute URI or a same document reference. The absolute URI is processed according to RFC2396. Same document references are recognised according to RFC 2396. All is in conformance with rfc 2396 at this point. Now the RDF parser comes in to play and it is required to transform the value of each rdf:about attribute into an absolute uri reference. If the RFC 2396 processor has produced an absolute uri reference, it need do nothing. If however, it is a same document reference, then, just as a browser will handle same document references specially, so does RDF. It transforms the same document reference into an absolute URI according to an algorithm defined by the RDF specs. The mimetype of an rdf document will be text/xml+rdf. As far as xml base and rfc 2396 are concerned, this is application code over which they have no say. What I have tried to do here is to position RDF as an application built on top of XML and to suggest that XML should not be allowed to express constraints on how applications process it. There is a deal of sophistry in this argument :( but RFC 2396 doesn't really meet our needs. Are there any plans to update/refine it in the near future? |
|
report:
Brian McBride,
30 Jan 2003,
URI-WG mailing list:
Please review the RDFCore last call working drafts which are linked from http://www.w3.org/2001/sw/RDFCore/#documents Whilst we would welcome your comments on any and all aspects of these documents, the WG particularly requests feedback on: o the proposed used of xml:base, and especially its handling of same document references http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-ID-xml-base http://www.w3.org/TR/rdf-syntax-grammar/#section-baseURIs o the rdf interpretation of fragment identifiers http://www.w3.org/TR/rdf-concepts/#section-fragID The last call period for these documents ends on 21 Feb 2003. |
|
report:
Graham Klyne,
05 Mar 2003,
URI-WG mailing list:
Is there a way to specify a fragment identifer relative to the document in the current base URI. I can't see a way to do this. If I use ./#frag, then the final path component of the base URI is omitted. So I see no way of indicating a fragment of the base URI without including some part of the base URI. Er, "#frag", right? What am I missing? According to the URI spec, that is relative to the *current document*, as opposed to the current base URI. For example, when xml:base is used within an XML document, the #frag is not (as I understand) relative to the base URI. The URI spec is quite explicit about stating that when resolving #frag relative to some base URI, it refers to a fragment the *current document* as distinct from the base URI; cf. algorithm in section 5.2. |
|
action:
Roy T. Fielding,
07 Mar 2003,
URI-WG mailing list:
Note that this issue is a request to change the "current document" algorithm. This can be accomplished by changing the spec to remove the bit about current document and instead replace the empty URI with the base URI, later stating that a retrieval action must not take place if the new URI differs from the base URI only by its fragment. |
|
report:
Rob Cameron,
05 May 2003,
URI-WG mailing list:
In my implementation, I've assumed the following change in the pseudocode for the algorithm in 5.2 if (R.path == "") then if defined(R.query) then T.path = Base.path; T.query = R.query; else -- An empty reference refers to the current document return (current-document, fragment); endif; becomes if (R.path == "") then T.path = Base.path; if defined(R.query) then T.query = R.query; else T.query = Base.query; endif; This seems consistent with the requests of the RDF group and gives a clean, well-behaved algorithm. |
|
action:
Roy T. Fielding,
23 May 2003,
draft 02:
Removed the special-case treatment of same-document references in favor of a section that explains that a new retrieval action should not be made if the target URI and base URI, excluding fragments, match. |
018-IPv6-example | RFC 2732 example bug |
---|---|
added 00 | IPv6 |
report:
Robert Graf,
24 Apr 2002,
private mail:
On RFC 2732 Page 1 / Point 2 you can find this example: http://[::192.9.5.5]/ipng 1. When I take a look on the RFC 2373 logic (Page 21/Appendix B): IPv6address = hexpart [ ":" IPv4address ] IPv4address = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT IPv6prefix = hexpart "/" 1*2DIGIT hexpart = hexseq | hexseq "::" [ hexseq ] | "::" [ hexseq ] hexseq = hex4 *( ":" hex4) hex4 = 1*4HEXDIG 2. When I take a look on the RFC 2732 logic update (Page 2): host = hostname | IPv4address | IPv6reference ipv6reference = "[" IPv6address "]" 3. Let's do the example. 3.1. When we split the 'host' we land in 'IPv6reference' and then in 'IPv6address'. 3.2. In the 'hexpart' we land in the 3rd part with "::192" which is ok. But what should happen now with '.9.5.5'? It's definitly not a part of the description above but should be valid as described in RFC 2732. |
|
report:
Robert Graf,
26 Apr 2002,
private mail:
You should also change "host = hostname | IPv4address | IPv6reference" to "host = hostname | IPv6reference | IPv4address" because the IP4address is filled via the IPv6reference |
|
action:
Roy T. Fielding,
26 Oct 2002,
draft 00:
IPv6 literals have been added to the list of possible identifiers for the host portion of a server component, as described by RFC 2732, but in the reverse order to reflect disambiguation rules. Since RFC 2732 defers to RFC 2373 for definition of an IPv6 literal address, which unfortunately has an incorrect ABNF description of IPv6address, I created a new ABNF rule for IPv6address that matches the text representations defined by Section 2.2 of RFC 2373. Likewise, the definition of IPv4address has been improved in order to limit each decimal octet to the range 0-255. |
019-URI-URL-URN | URI/URL/URN contemporary view |
---|---|
fixed 02 | terminology |
report:
Michael Mealling,
01 May 2002,
URI-WG mailing list:
I think the consensus built in the IG and reported in draft-mealling-uri-ig-02.txt is a good place to start. Especially the recommendation: 1. The W3C and IETF should jointly develop and endorse a model for URIs, URLs and URNs consistent with the '"Contemporary View" described in section 1, and which considers the additional URI issues listed or alluded to in section 3. Just so you won't have to go dig the draft up, this is the "Contemporary View": Over time, the importance of this additional level of hierarchy seemed to lessen; the view became that an individual scheme does not need to be cast into one of a discrete set of URI types such as "URL", "URN", "URC", etc. Web-identifer schemes are in general URI schemes; a given URI scheme may define subspaces. Thus "http:" is a URI scheme. "urn:" is also a URI scheme; it defines subspaces, called "namespaces". For example, the set of URNs of the form "urn:isbn:n-nn-nnnnnn-n" is a URN namespace. ("isbn" is an URN namespace identifier. It is not a "URN scheme" nor a "URI scheme"). Further according to the contemporary view, the term "URL" does not refer to a formal partition of URI space; rather, URL is a useful but informal concept: a URL is a type of URI that identifies a resource via a representation of its primary access mechanism (e.g., its network "location"), rather than by some other attributes it may have. Thus as we noted, "http:" is a URI scheme. An http URI is a URL. The phrase "URL scheme" is now used infrequently, usually to refer to some subclass of URI schemes which exclude URNs. |
|
action:
Roy T. Fielding,
27 Oct 2002,
draft 00:
Fixed by rewriting the section on URI, URL, and URN, and changing all use of the term URL in the specification to URI. |
|
report:
Tim Bray,
21 Feb 2003,
URI-WG mailing list:
Sec 1.2 - the spec says it deprecates the terms URL and URN and I'm not sure it really does. What it's really deprecating is the notion of a clean useful separation between locators and names. I've never seen "URN" used in this sense anyhow, in fact I've never seen it used aside from a reference to what the URN RFC defines, which is hard to argue against. If you want to deprecate the term URL that's at least consistent, although once again I have some nervousness about trying, in the Academie Francaise style, to stop people from using words they want to use. Potential reword of the paragraph: 'An individual scheme does not need to classified as being just one of "name" and "locator". Instances of URIs from any given scheme may have the characteristics of names or locators or both, often depending on the persistence and care in the assignment of of identifiers by the naming authority, rather than any quality of the scheme. For this reason, this specification deprecates the use of the term URN for anything but URIs in the "urn" scheme as described in RFC 2141. This specification also deprecates the term "URL".' Sec 1.2, fourth para; the phrase "just like any identifier" is superfluous. |
|
action:
Roy T. Fielding,
02 May 2002,
draft 02:
Done. |
020-utf8-default | Defaulting to UTF-8 for unknown encoding |
---|---|
closed | characters |
report:
Roy T. Fielding,
01 May 2002,
URI-WG mailing list:
The only thing I want to include is the default: %xx means the character encoded as xx in UTF-8. That is already the default for MSIE and should be for other browsers as well, and will simplify the specification. |
|
report:
Bjoern Hoehrmann,
04 May 2002,
URI-WG mailing list:
I disagree. While it's the default in MSIE for URIs, the user enters into the address bar, it's not the default for the vast majority of %xx encoded octets requested by MSIE, they originate from HTML forms where MSIE uses the document or user selected character encoding scheme to generate the octets, hence most %xx encoded octets representing non-ASCII characters are not part of valid UTF-8 sequences. There is no facility to define any other encoding than UTF-8, hence applications assuming UTF-8 encoding are said to fail. |
|
report:
Martin Duerst,
29 May 2002,
URI-WG mailing list:
I would be extremely delighted if we could just go and say "it's UTF-8, and nothing else". Unfortunately, that's not possible. But I think it's a very good idea to make clear in the revision that UTF-8 is where things are moving, rather than just the current "For example, UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences of characters in the repertoire of ISO 10646." |
|
action:
Roy T. Fielding,
02 Mar 2003,
URI-WG mailing list:
More UTF-8 examples are given in draft 01. That's all for now. |
021-relative-examples | relative URI examples could be improved |
---|---|
fixed 02 | examples |
report:
Larry Masinter,
16 May 2002,
URI-WG mailing list:
The example of resolving a relative URL could be improved. It uses a base of http://a/b/c/d;p?q Not wanting to read the RFC end to end, it took me a bit of searching to find that the ;p part is a "parameter" and the ?q part is a "query". But I have no idea what their relevance is to this example. It they are to be ignored when attaching the relative parts, it would be nice to say so. The basic expansion has one very confusing and not explained aspect. The relative path g is said to expand to http://a/b/c/g instead of http://a/b/c/d/g. The other expansions are obvious once the "remove d" rule is applied. Would a base of http://a/b/c/d/ plus g expand to http://a/b/c/d/g? The examples should have enough annotation to mostly stand on their own and to reinforce the concepts. |
|
report:
Stefan Eissing,
17 May 2002,
URI-WG mailing list:
I found them to be very helpful in their current form. The only thing I would state differently is the handling of too many ../ in the resolved uri. The RFC currently states that base http://host/a/b ref ../../c resolves to http://host/../c and continues that removing the /.. at the beginning is allowed. My observation is that removing /.. is the norm nowadays and therefore the example should be the other way with a note that keeping /.. is allowed. |
|
action:
Roy T. Fielding,
17 May 2002,
URI-WG mailing list:
The examples are intended to identify common bugs or deprecated features in software. The role of ";" changed from RFC 1808, so the tests can be used to differentiate between an 1808-compliant parser and a 2396-compliant parser, thus identifying places where changes are needed. I'd like to expand the tests, particularly with other example base URI, since there is one errata that would have been discovered that way. More annotation is welcome. |
|
action:
Roy T. Fielding,
23 May 2003,
draft 02:
The examples have been updated for the new treatment of dot-segments and double-quotes have been added as delimiters to prevent the RFC editing process from losing them during nroff processing. The examples have been moved to the end of the section on relative resolution, since that is where the process is described. |
022-definitions | definitions for operations on URIs |
---|---|
added 02 | terminology |
report:
Larry Masinter,
13 Jul 2002,
URI-WG mailing list:
http://lists.w3.org/Archives/Public/www-tag/2002Jul/0169.html These look like interesting possible additions to the URI specification. URI Resolution: The process of determining an access mechanism and appropriate parameters necessary to dereference a URI. e.g. in the case of an HTTP URI, this process resolves the URI into an IP address, a port number, a host name (possibly optional) and a request URI. Resolution may require several iterations. URI Dereference: The process of using an access mechanism and parameters generated by URI resolution to create, inspect or modify resource state. URI Retrieval: The use of URI dereference to retrieve representations of resource state. |
|
action:
Roy T. Fielding,
06 May 2003,
draft 02:
I have added the definitions and reorganized the entire section 1. |
023-URI-plural | URI or URIs for plural |
---|---|
fixed 00 | terminology |
report:
Tim Bray,
09 Aug 2002,
www-tag mailing list:
I note that Roy of late has been using URI as its own plural. Elegant and defensible, but I prefer URIs as less surprising to the eye. Even more, I prefer consistency. Clearly this is a subject on which consensus is not remotely possible. |
|
action:
Roy T. Fielding,
17 May 2002,
URI-WG mailing list:
I prefer whichever one is easier to say while speaking, since I do not believe in the theory that people expand acronyms as they read. I am fine with either one, provided I only have to change it once. |
|
action:
Roy T. Fielding,
17 Oct 2002,
draft 00:
Fixed by rewriting URI to "a URI" or URIs, as appropriate. |
024-identity | Resource should not be defined as anything that has identity |
---|---|
fixed 02 | terminology |
report:
Miles Sabin,
09 Sep 2002,
URI-WG mailing list:
http://lists.w3.org/Archives/Public/uri/2002Sep/0016.html At issue is the first sentence of the informal definition of resource in RFC 2396 1.1, A resource can be anything that has identity. "that has identity" is redundant because *everything* has identity in the only reasonably straightforward understanding of identity, ie. the logical truth in all but the most obscure formal systems that, (Vx) x = x Even though redundant, this qualifier has had the unfortunate consequence of leaving this sentence open to wildly different interpretations, * It has been read as implying that the set of possible resources is a subset of the set of things: the subset that has identity as opposed to the subset that doesn't. Dan Brickley reports that this confusion, and the subsequent hunt for things which *don't* have identity and some means for identifying them, has caused trouble in RDF circles. * It has been misread as, A resource can be anything that has an identifier (eg. a URI). * It has been misread as, A resource can be anything that can be identified (via some effective mechanism). I don't believe that any of these were the authors intent, so to clear up any confusion, the "that has identity" qualifier should be dropped. That still leaves open the question of whether or not the residual, A resource can be anything. is either true or makes sense. This is controversial, no doubt, but it's better not to have the controversy obscured by a distracting qualification. |
|
action:
Roy T. Fielding,
12 Sep 2002,
issues list:
The sentence says "can be", which implies exactly what I meant it to imply: that anything with identity can be a resource but not necessarily is a resource. I see no reason to change it. The important bit is that sameness of identity is the important characteristic -- the defining characteristic -- of a resource. The goal of the sentence is to describe the essence of what it means to be a resource. None of the other suggestions do that. |
|
report:
Pat Hayes,
21 Apr 2003,
URI-WG mailing list:
1. I appeal to the WG to please explain in more detail what the word "resource" is intended to refer to, if only in broad outline. In particular, If there is an intent to limit the meaning of "resource" to some subset of the universe of logically possible entities, it would be most valuable if this could be spelled out as clearly as possible. This issue appears to be central to many aspects of the semantic web, and probably to the web more generally. The language of the introductory sections of RFC 2396, reproduced in the current version of your document draft, is not sufficient to achieve a clear communication of this intent as it stands. As some examples, are any of the following NOT resources in the sense used in your document? a. A document which has not yet been written, eg a book in progress, which has not (yet) been assigned a title or ISBN number. b. A particular elephant, eg one in a zoo. c. A particular elephant which is now dead, eg the original Jumbo. d. A particular elephant which it is hoped will be the product of a future mating between two elephants. e. Santa Clause (in any sense, eg as a fictional character, or as a concept in folk mythology, or whatever. Or use Sherlock Holmes or Superman or any other fictional character, if you prefer.) f. The planet Mars. g. The number one thousand seven hundred and twenty-nine. h. An abstract class or category, such as the class of all types of French red wine. ---- 2. Miles Sabin, in an archived email comment, points out that the phrase 'that has an identity' is redundant as a qualifier, since everything necessarily has an identity. Your response says that 'The goal of the sentence ("A resource can be anything that has an identity.") is to describe the essence of what it means to be a resource' and that 'sameness of identity is the ... defining characteristic of a resource'. The only way I can interpret this is as saying that a resource can be anything, since the defining characteristic is apparently a tautology. Is that what you intended? If not, can you clarify your intended meaning? In particular, how do the following sentences differ in meaning, in your view? A. Anything with identity can be a resource but not necessarily is a resource. B. Anything can be a resource but not necessarily is a resource. It might help if you could indicate what you consider the phrase 'has an identity' to mean, particularly when used as a qualifier, perhaps by giving an example of something that does not have an identity, in your sense. ---- 3. I would like to ask for some explication of the use of the words "can be" in the definition, to which you draw attention in your reply to Sabin. I take it that this is intended to convey that there is a distinction between entities which could possibly be resources, and those that actually are resources. If this is right, can you explain the criteria for distinguishing actual from merely possible resources? That is, suppose X is something which *could be* a resource; what would make X *actually be* a resource? Can something become an actual resource at a time, or cease to be a resource at a time? Can something be intermittently an actual resource, or must each actual resource have an uninterrupted period during which it is being the resource that it in fact is? Questions like this will be central if we try to make formal theories of resource-hood for use by reasoners. ---- 4. RFC 2396 includes a particular note which is very hard to interpret: "The resource is the conceptual mapping to an entity or set of entities, not necessarily the entity which corresponds to that mapping at any particular instance in time. Thus, a resource can remain constant even when its content---the entities to which it currently corresponds---changes over time, provided that the conceptual mapping is not changed in the process." There are several problems with this. First, it does not specify what it means by "conceptual mapping", nor how such a mapping can remain constant while its range changes. Second, it does not say what is meant by the phrase "entity which corresponds to [a] mapping at [an] instant of time". What does it mean for something to 'correspond to' a mapping? Third, the use of the word "content" seems to suggest that resources are something like representations or descriptions, rather than the entities which are represented or described; but this seems to be at odds with what the document says in the immediately preceding paragraphs. For example, we are told explicitly that a person or a book can be a resource, but neither people nor books are the kinds of entity which would normally be described as having "content". Fourth, the reference to time and change seems to imply that resources are inherently temporal or dynamic in their nature; but this does not seem to be reflected in any other part of the document, or in URI syntax, or in the examples given explicitly in the immediately previous paragraphs. For example, what kind of mappings can have different things 'corresponding' to them at different times? Fifth, is this paragraph supposed to apply to all resources, or only to indicate that some resources may be dynamic in the way indicated? (My purpose, let me emphasize, is not to urge that any particular interpretation be put on these words, only that their intended meaning be spelled out more clearly. ) ---- 5. The RFC 2396 text explicitly asserts that "not all resources are network "retrievable" ", but almost immediately then says: "having identified a resource, a system may perform a variety of operations on the resource, as might be characterized by such words as 'access', 'update', 'replace' or 'find attributes' " These assertions seem to be at odds with each other, and to reflect different notions of 'resource', since the second sentence seems to refer only to entities which are "network-retrievable". Clearly, a resource which is not retrievable is not available to have operations performed on it, even if it is in some sense identified. As an example, the SS number of a dead US citizen is sufficient to 'identify' that person in a sense, but does not provide any way to perform operations on the deceased. Again, it would be helpful if the apparent contradiction could be explained. |
|
action:
Roy T. Fielding,
22 Apr 2003,
issues list:
I explained rfc2396's usage of "identity" in http://lists.w3.org/Archives/Public/www-tag/2002Jul/0128.html |
|
report:
Tim Bray,
22 Apr 2003,
URI-WG mailing list:
I have a suggested wording change, because while I have been largely unimpressed by the philosophical jargon being thrown around here recently, I do agree that the current definition "A resource can be anything that has identity" offers significant room for improvement; among other things it deserves to be called out and not sequestered in a <dd>. Here you go: Resources and URIs Many different abstract, informational, and physical things may be resources. URIs exist to identify resources, but this "identity" relationship has both social and technical dimensions. For example, it is incontrovertible that the URI http://www.tbray.org/A0.png identifies a resource which is a particular bitmapped graphic (I assert this, I control tbray.org, and the assertion is verifiable via technical means) and that the URI http://www.w3.org/1999/xhtml identifies a resource which is a well-known markup vocabulary (established by social convention). It is possible for ambiguity to enter this relationship; for example, does http://www.w3.org/Consortium identify an organization or a particular HTML page on its website? A few principles apply: - While the definitions of URI and Resource are somewhat circular, the existence of a URI does not imply the existence of a resource. For example, the URI http://example.com/386751531 identifies no resource. - Formally, resources could exist without URIs - for example, there is a picture of my cat somewhere on http://www.tbray.org but I'm not publishing a URI. However, such resources have no practical import or utility. - URI schemes may impose constraints on the types of resource they identify; for example, ftp: URIs identify files and directories accessible using the FTP protocol. - Ambiguity in the characterization of what resource a URI identifies is always undesirable and reduces the utility of both the resource and the URI. |
|
action:
Roy T. Fielding,
23 Apr 2003,
issues list:
A ridiculous amount of discussion took place on the mailing list regarding this issue without illuminating it further, so I won't copy it here except by reference to the main threads: http://lists.w3.org/Archives/Public/uri/2003Apr/0028.html http://lists.w3.org/Archives/Public/uri/2003Apr/0041.html http://lists.w3.org/Archives/Public/uri/2003Apr/0062.html |
|
report:
Joshua Allen,
23 Apr 2003,
URI-WG mailing list:
As far as I can tell, it is only the differing choice of words that makes everyone appear to be disagreeing. As long as the words chosen are clearly defined, I see no point in getting hung up over *which* word is used. We divide the world up like this: A. There are things. Everything is a "thing". There are no exceptions. B. There are things which *might* have a URI bound to them. C. There are things which *do* have a URI bound to them. Is B the same thing as A? *That* question is irrelevant and not worth arguing about IMO. It seems like the only *legitimate* confusion is around the names for A/B and C. I personally have always thought that: A="thing", B="resource", and C="resource with a URI". You (MM) are saying that: A="thing", B="thing", C="Resource" I personally have no problem accepting your naming for "C", so long as it is very clear that this is different than A or B. I would also (personally) suggest that terminology be kept clear by using: A="thing", B="thing which hasn't been bound to a URI", C="Resource". |
|
action:
Roy T. Fielding,
27 Apr 2003,
draft 02:
I have rewritten the definitions. It would be pointless to attempt to further define words that can be found in any dictionary. Instead, I added more examples and chose words that are less likely to prick the sensibilities of those who use URIs only for denotation. Additional terminology will be addressed in issue 022-definitions. |
025-rel_segment | rel_segment is defined without distinguishing param |
---|---|
fixed 00 | segment |
report:
Martin Duerst,
10 Oct 2002,
URI-WG mailing list:
Looking through the URI syntax in detail, I became aware of the following 'anomaly': parameters are not allowed in the first segment of a relative URI (if it doesn't start with a slash). The relevant rules are: relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] net_path = "//" authority [ abs_path ] abs_path = "/" path_segments rel_path = rel_segment [ abs_path ] rel_segment = 1*( unreserved | escaped | ";" | "@" | "&" | "=" | "+" | "$" | "," ) path_segments = segment *( "/" segment ) segment = *pchar *( ";" param ) param = *pchar pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" | "$" | "," So in "abc;def/ghi;jkl", 'jkl' is a parameter, but 'def' isn't. On the other hand, in "/abc;def/ghi;jkl", both 'def' and 'jkl' are parameters. Is this an error in the syntax, or can somebody explain this? |
|
action:
Roy T. Fielding,
11 Oct 2002,
URI-WG mailing list:
No, but I agree that it is confusing. They are defined differently because rel_segment cannot be empty. Syntactically they are equivalent. I'll find a better way to write it. |
|
action:
Roy T. Fielding,
28 Oct 2002,
draft 00:
Fixed by removing the rule for param and simply stating why ";" and "=" are reserved within path segments. |
026-ABNF | replace existing BNF with standard ABNF of RFC 2234 |
---|---|
fixed 00 | bnf |
report:
Roy T. Fielding,
22 Oct 2002,
URI-WG mailing list:
It also looks like we'll have to switch to the formal ABNF of RFC 2234 in order to define IPv4 addresses correctly. At least that will make the IESG happier, but it sure is a pain in the editorial fingers. |
|
action:
Roy T. Fielding,
28 Oct 2002,
draft 00:
The ad-hoc BNF syntax has been replaced with the ABNF of RFC 2234. This change required all rule names that formerly included underscore characters to be renamed with a dash instead. Likewise, absoluteURI and relativeURI have been changed to absolute-URI and relative-URI, respectively, for consistency. |
027-ref-HTML | draft 00 contains an obsolete ref to RFC 1866 |
---|---|
fixed 01 | references |
report:
Dan Kohn,
09 Nov 2002,
URI-WG mailing list:
draft 00 contains an obsolete reference to RFC 1866, which was obsoleted by RFC 2854. This reference should be replaced with one to http://www.w3.org/TR/html401 |
|
action:
Roy T. Fielding,
28 Feb 2003,
issues list:
I replaced it with [HTML] Raggett, D., Le Hors, A. and Jacobs, I., "Hypertext Markup Language (HTML 4.01) Specification", December 1999. in draft 01. |
028-ref-rfc0952 | draft 00 normative reference to RFC 952 |
---|---|
fixed 01 | references |
report:
Dan Kohn,
09 Nov 2002,
URI-WG mailing list:
I question whether a normative reference to RFC 952, status unknown (http://www.normos.org/en/summaries/ietf/rfc/rfc952.html), is appropriate for dotted-decimal notation, versus a normative reference to RFC 791, or to section 2.1 of RFC 1123, which is already referenced. |
|
action:
Roy T. Fielding,
25 Feb 2003,
issues list:
I chose 952 because it was the only description of the notation, and is in fact referenced as such by 1123. In any case, you are right that it should be non-normative, as should the other related references because we define our own syntax rather than depend on those RFCs. I have fixed this in draft 01. |
029-decimal-IP | add security considerations for misleading use of decimal IP |
---|---|
added 01 | security |
report:
Dan Kohn,
09 Nov 2002,
URI-WG mailing list:
I would suggest adding a paragraph to the Security Considerations about how "malicious URLs" can be crafted combining misleading usernames/passwords with decimal IP addresses, such as http://www.microsoft.com@3492563303/ as described in http://www.counterpane.com/crypto-gram-0102.html#7 http://rr.sans.org/threats/semantic.php This is, of course, an attack on users and not on the URI specification, but it is possible because regular users don't understand the URI spec (and never will). |
|
action:
Roy T. Fielding,
25 Feb 2003,
issues list:
This is tied to the gethostbyname issue as well. |
|
action:
Roy T. Fielding,
01 Mar 2002,
draft 01:
Added to the Security Considerations for draft 01. |
030-IPv6-bnf | draft 00 errors in IPv6 syntax |
---|---|
fixed 01 | IPv6 |
report:
Zefram,
22 Nov 2002,
private mail:
Finally, wherever the ABNF ends up, note that the ABNF given in rfc2396bis has several errors. In summary: dec-octet matches "12345"; dec-octet doesn't match "039" (convention does allow leading zeroes, up to three digits total); IPv6address matches "::123:"; IPv6address doesn't match "1:2:3:4:5:6::" or "1:2:3:4:5:6:7::"; IPv6address doesn't match "1:2:3:4:5::9.9.9.9". The revised ABNF that I give below corrects all of these errors, and I strongly believe it to be completely correct. (I also revised the layout, and having experimented with variants I think this is as neat as it can be subject to RFC line length limits.) IPv6address = 7(h4 ":") h4 / "::" 6(h4 ":") h4 / [ h4 ] "::" 5(h4 ":") h4 / [ *1(h4 ":") h4 ] "::" 4(h4 ":") h4 / [ *2(h4 ":") h4 ] "::" 3(h4 ":") h4 / [ *3(h4 ":") h4 ] "::" 2(h4 ":") h4 / [ *4(h4 ":") h4 ] "::" h4 ":" h4 / [ *5(h4 ":") h4 ] "::" h4 / [ *6(h4 ":") h4 ] "::" / 6(h4 ":") IPv4address / "::" 5(h4 ":") IPv4address / [ h4 ] "::" 4(h4 ":") IPv4address / [ *1(h4 ":") h4 ] "::" 3(h4 ":") IPv4address / [ *2(h4 ":") h4 ] "::" 2(h4 ":") IPv4address / [ *3(h4 ":") h4 ] "::" h4 ":" IPv4address / [ *4(h4 ":") h4 ] "::" IPv4address h4 = 1*4HEXDIG IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet dec-octet = 1*2DIGIT ; 0-9, 00-99 / ( "0" / "1" ) 2DIGIT ; 000-199 / "2" %x30-34 DIGIT ; 200-249 / "25" %x30-35 ; 250-255 (It's possible to considerably shorten the IPv6address rule by factoring out a production of ( h4 ":" h4 / IPv4address ), but I don't think it's any clearer, since we pedagogically distinguish IPv6 addresses with embedded IPv4 addresses from those that don't.) |
|
action:
Roy T. Fielding,
05 Dec 2002,
URI-WG mailing list:
How about this one: IPv6address = 6( h4 ":" ) ls32 / "::" 5( h4 ":" ) ls32 / [ h4 ] "::" 4( h4 ":" ) ls32 / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 / [ *4( h4 ":" ) h4 ] "::" ls32 / [ *5( h4 ":" ) h4 ] "::" h4 / [ *6( h4 ":" ) h4 ] "::" ls32 = ( h4 ":" h4 ) / IPv4address ; least-significant 32 bits of address |
031-query-def | query definition |
---|---|
fixed 02 | query |
report:
Hrvoje Simic,
13 Nov 2002,
URI-WG mailing list:
In section 3.4. RFC 2396 says: "The query component is a string of information to be interpreted by the resource." If the resource is identified before the query component is interpreted, why is the query a part of the identifier? [1] I believe the RFC 2396 revision should redefine the query component of the URI. I found that Jim Whitehead had the same complaint on the definition four years ago: [[ This implies to me that if it is to be interpreted by the resource, it cannot also be identifying that resource. My rationale is the resource needs to be identified first, before the query component can be passed to it for interpretation, hence the query component cannot be part of the resource identifier. ]] [2] Larry Masinter replied: [[ I can see now how you'd come to that conclusion; it does sound that way. But I'll claim that we didn't MEAN IT. ]] [3] More recent posts by Mark Nottingham: [[ mailto allows you to specify a subject, body, etc. in the query component, which is defined by 2396 as: "...a string of information to be interpreted by the resource." Considering other uses of queries, this seems to fit in nicely. ]] [4] [[ This touches on something that's been on my mind for a while. If a query is "a string of information to be interpreted by the resource," isn't it the case that a URI with a query refers to a resource, rather than just identifies one? E.g., <http://www.example.com/foo?bar=baz> is a reference to the resource <http://www.example.com/fooglt;. I.e., shouldn't the definition of URI-Reference (rather than URI) include not only fragments, but also queries? ]] [5] Reply by Martin Duerst: [[ Definitions are often chosen on their practical value, rather than on philosophical considerations. In this case, the URI is what you (e.g.) send to the server, the URI Reference is what you (e.g.) put into an attribute. ]] [6] My ideas on redefinition: query should be "identifying the resource within the scope of that scheme and authority" just as the path is. The difference between the components may be in ordering: while the path segments must be in strict order (defining the path through a hierarchy), query segments may be in arbitrary order, like "parameters" or "switches". Information in query segments may also be optional and generally more detailed than the path segments [1]. As for the troubling "mailto query", no such thing exists. The "mailto" scheme doesn't comply with the "generic URI" syntax from the section 3 of the RFC 2396. The defining document, RFC 2368, in section 2 defines "headers" with similar syntax but unrelated to RFC 2396 "query". Hrvoje Simic FER, University of Zagreb, Croatia mailto:hrvoje.simic@fer.hr mailto:hrvoje.simic@zg.hinet.hr [1] http://www.tel.fer.hr/users/hsimic/cuc2002 [2] http://lists.w3.org/Archives/Public/w3c-dist-auth/1998OctDec/0180.html [3] http://lists.w3.org/Archives/Public/w3c-dist-auth/1998OctDec/0201.html [4] http://lists.w3.org/Archives/Public/uri/2002Apr/0010.html [5] http://lists.w3.org/Archives/Public/uri/2002Apr/0011.html [6] http://lists.w3.org/Archives/Public/uri/2002Apr/0014.html |
|
report:
Mark Nottingham,
13 Nov 2002,
URI-WG mailing list:
Those feel like guidelines more than hard semantics; IIRC, the main distinction between URI path segments and URI parameters is that parameters aren't ordered, so that aspect doesn't distinguish queries. Perhaps what does distinguish queries is that while they are used in identifying the resource, they aren't used directly in locating/dereferencing it; just as fragment identifier semantics are interpreted on the client side in the scope of the resource's representation, so queries are interpreted on the server side in the scope of the located resource (which may be a new concept). |
|
report:
Graham Klyne,
13 Nov 2002,
URI-WG mailing list:
How they are interpreted is entirely up to the software that provides access to resources for the indicated authority. |
|
report:
Hrvoje Simic,
14 Nov 2002,
URI-WG mailing list:
1) Should the query component be redefined, and how? Yes, but it's hard to think up a good definition. In the "classic" Web, it was the parameters you passed to the program found in a file on a computer using a protocol. Now these concepts of protocol, computer, file path and parameters are much more abstract. Should it be "http://about.example.org" or "http://example.org/about"? "/messages/1-10" or "/messages?from=1&to=10"? Are there any "hard semantic" reasons for preferring one solution over the other, or just guidelines? Evolution of URI towards an abstract identifier blurred the differences between its components. Path is effectively defined for URIs "hierarchical in nature", which sounds like a guideline. Query may be left opaque and abstract, something like: "URI component of arbitrary syntax left for server-specific purposes". Or we may crack it open and come to the next issue: 2) Should the definition include details about the query structure (like it did for the path)? I see that almost every message in this thread mentions query structure. But RFC 2396 and RFC 2616 (defining http-URI) don't include such details. My name for the parts of the query (separated with ampersands or semicolons) is "query segments" - just to make query sound more like the path. I agree that the query should preserve the order of its segments. The order may matter to the specific server. Anyway, the segments must be listed in _some_ order, and I see no advantage in allowing the network to shuffle them. What I really meant was: path segments must be parsed in the fixed order, from left to right. If you have "a/b/c" you parse "a" to identify the branch in the next level of hierarchy and you hand over "b/c" to it. But if you have "?a;b;c" you can look for a "b" and then continue to parse the "?a;c". This allows clients to communicate information about resource's identity that isn't naturally placed in the hierarchy, i.e. that doesn't fit nicely in a sequence of steps through the hierarchy. [1] http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2 [2] http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4 |
|
report:
Mark Nottingham,
14 Nov 2002,
URI-WG mailing list:
'Semantics' isn't the correct term to use; Graham pointed out that this implies too much. His suggestion was 'processing model', and that seems to capture it very well; When used as a locator, a URI has a processing model that is used (usually to retrieve a representation of the resource). Each URI scheme defines its own processing model that enables location of resources of that type. The question, then, is whether these (not resource-, site- or non-location) processing models are exclusive to the agent that is doing the location. If the query is considered as data to be consumed by the resource, this means that the processing model is effectively distributed; the resource consumes part of the URI as well (which indeed seems to be the case today for most uses of query). Remember that 'Resource' is an abstract concept; it does not have to have a one to one mapping to code on the back end. Therefore, I see no problem with saying that query is data to be consumed by the resource during the process of location; if the resoruce happens to be spread across several back-end facilities on the server, so be it. To summarize, then, the idea is that: * Query is part of the URI for the purposes of identification; every URI with a different query string is a different identifier (just as it is now). * Query is data to be consumed by the resource during the process of location (just as 2396 says). * It is worthwhile to distinguish between URIs and URLs, not because they identify different things, but because these terms can be used to distinguish between different contexts of use - identification vs. location. E.g., Given: http://www.example.com/foo?bar When used as an identifier (URI), the resource is http://www.example.com/foo?bar While, in the context of a locator (URL), the resource located is considered http://www.example.com/foo I realise that this is a largely theoretical problem, in that it doesn't affect how anything actually works; however, it may affect how people think, which is just as, if not more, important. |
|
action:
Roy T. Fielding,
15 May 2003,
draft 02:
The definition of query has been rewritten for draft 02: The query component contains non-hierarchical data that, along with data in the path component, serves to identify a resource within the scope of that URI's scheme and naming authority (if any). The query component is indicated by the first question-mark ("?") character and terminated by a crosshatch ("#") character or by the end of the URI. query = *( pchar / "/" / "?" ) The characters slash ("/") and question-mark ("?") are allowed to represent data within the query component, but such use is discouraged; incorrect implementations of relative URI resolution often fail to distinguish them from hierarchical separators, thus resulting in non-interoperable results while parsing relative references. However, since query components are often used to carry identifying information in the form of "key=value" pairs, and one frequenty used value is a reference to another URI, it is sometimes better for usability to include those characters unescaped. |
032-component-examples | add more examples for generic syntax components |
---|---|
pending | examples |
report:
Tim Bray,
21 Feb 2003,
URI-WG mailing list:
Section 3 is awfully short of examples. I would think the usefulness would be improved by including at least one example for each of 3.1, 3.2.1, 3.2.2, 3.3, and 3.4. If others agree, I would volunteer to cook up the examples. |
|
action:
Roy T. Fielding,
25 Feb 2003,
URI-WG mailing list:
Sure. It would be best if they covered the range of variance, so that the folks who try to implement according to examples (and not BNF) are not led too far astray. [I have seen plenty of cases where implementers of RFC 2616 looked at the examples and implemented only the cases described, ignoring the actual syntax specification.] |
033-dot-segments | relativising an absolute reference should be invertible |
---|---|
fixed 02 | relative |
report:
Tim Berners-Lee,
18 Nov 2002,
TAG meeting:
RFC2396 doesn't say that xxx/./yyy is equivalnet to xxx/yyy for any xxx and yyy. However, the only tenable situation is that they are equivalent, because we require that any URI can be relative-ized and absolute-ized back to its original. That is an (unspoken) axiom. When you relative-ize things and re-absolutize then, you cannot distinguih between the two, and so they HAVE to be equivalent. The URI spec should say that. We need to write down the axioms: if you take a URI, make it relative w.r.t. a base URI, then make it absolute w.r.t. the same base URI, you get the same starting URI... http://www.w3.org/2002/11/18-tag-summary |
|
action:
Roy T. Fielding,
17 Apr 2003,
issues list:
I would rephrase that as: When you relativize an absolute URI (A) using base (B) producing the relative reference (R), and then re-absolutize R using the same base B to produce an absolute reference A', then A' must equal A. |
|
report:
Tim Berners-Lee,
23 Jan 2003,
URI-WG mailing list:
The spec would do well to define the function from base and reference to URI and back again rel(u, base) and abs(u, base) and to point out that you can use abs(rel(u, base), base) for u in all circumstances. |
|
report:
Tim Bray,
25 Feb 2003,
URI-WG mailing list:
If I am given the URI http://example.com/a/./b/../c I will always, 100% of the time, regard that as http://example.com/a/c. I have just verified that the first two randomly-picked web browsers I picked in fact do this. So the assertion that this only applies to the relative form is, I assert, simply wrong and should be removed. I think you need to look more closely at what the browsers are doing. They send the /../ and /./ stuff to the server, whereupon an httpd will respond with a redirect to the correct URI. Nope. Peering deep into my high-powered research lab... I created a test file as follows: foo <a href="http://example.com/a/./b/../c">foo</a> bar I open it, put my mouse over the blue underlined "foo" and observe what appears in the status-bar of the browser. Under OS X, in each of IE, Mozilla, and Safari, the status bar shows http://example.com/a/c and I'm pretty sure it doesn't call out to the server to check. So I stand by my claim that deployed software normalizes /./ and /../ regardless of whether it's relative or absolute. |
|
action:
Larry Masinter,
25 Feb 2003,
URI-WG mailing list:
Whether "a/./b/../c" in a path component is equivalent to "a/c" is entirely dependent on the definition of the URI scheme. Some schemes may define the two as equivalent, others may not. The current definition of the 'http' URI scheme (in RFC 2616) does not specify this equivalence, although apparently popular browsers will turn http://example.dom/a/./b/../c into http://example.dom/a/c before sending. Do you think it should apply to all URI schemes that use the "generic syntax"? "rtsp:"? "ldap:"? What about schemes that use something like the "generic syntax" but make modifications? Note that mailto:a/./b/../@test.com sends a message to a/./b/../@test.com, i.e., it doesn't process them. I'm having trouble telling what happens without a protocol trace with ftp://ftp.ietf.org/ietf/../ietf/00dec/, or with ldap:. But I think it is a good idea to resist the tendency to jump from examination of the behavior of http URIs to assert properties of all URIs. |
|
action:
Roy T. Fielding,
25 Feb 2003,
issues list:
I still get those segments in httpd access log files, but all we need are two independent implementations to justify a change. I think it is safe to remove them based on the theory that "/" is reserved for the hierarchical syntax. I can't think of a real mailto example that would break, since even distinguished-name-based addresses are not going to have ".." or "." as a DN. |
|
action:
Roy T. Fielding,
23 May 2003,
draft 02:
Defined "." and ".." path segments as being applicable to all URI and should be removed by resolvers and normalizers. Clearly defined that a path segment including a colon cannot be used as the first segment in a relative-path reference. The relative resolution process is invertible, though I have not included a single process for doing so because there is no agreed upon standard for converting absolute references to a relative form. |
034-identifier | identifier is not just a sequence of characters |
---|---|
closed | terminology |
report:
John A. Kunze,
23 Jan 2003,
URI-WG mailing list:
If changes to basic terminology are up, consider how damaging it is that this seminal web spec defines an identifier as "a sequence of characters". It's then impossible to talk sensibly about a broken identifier (hmmm, are we talking about missing or damaged characters?). It's the reference role that breaks. Much better to be explicit: An identifier is an association between a string (a sequence of characters) and an information resource. In full generality, that association is made manifest by a "record" (eg, a cataloging or other metadata record) that binds the identifier string to a set of identifying resource characteristics. For the average URL, that record's existence is implied if the URL string, when submitted to a web server, returns some document that is a webmaster's attempt to realize the correct binding. An especially nice result of this definition is that it permits people to more quickly conclude that there's no reason why a URL can't be just as persistent as any other identifier (if not more so). It's all about the service behind it. But that's a case to be made elsewhere. The URI spec would do electronic permanence a favor if it included this one definitional change. |
|
action:
Roy T. Fielding,
20 Mar 2003,
URI BOF:
The definition actually says An identifier is an object that can act as a reference to something that has identity. In the case of a URI, the object is a sequence of characters with a restricted syntax. and thus the suggestion seems to be based on something other than RFC 2396. |
|
action:
Roy T. Fielding,
27 Apr 2003,
draft 02:
The definition has been rewritten in response to issue 024-identity. |
035-scheme-escaping | %HH escaping should not be scheme-dependent |
---|---|
fixed 01 | characters |
report:
Martin Duerst,
30 Jan 2003,
URI-WG mailing list:
Doing careful readings of RFC 2396 for various purposes, I found the following paragraph in "2.1 URI and non-ASCII characters": A URI scheme may define a mapping from URI characters to octets; whether this is done depends on the scheme. Commonly, within a delimited component of a URI, a sequence of characters may be used to represent a sequence of octets. For example, the character "a" represents the octet 97 (decimal), while the character sequence "%", "0", "a" represents the octet 10 (decimal). This seems to indicate that a scheme is free to define whether it wants to use %0a for the octet 10 (decimal) or not, and whether it indeed wants to define a mapping from URI characters to octets. As far as I understand, %hh is always usable, and I don't know about any schemes that define explicitly that this can be used. It may have been that this paragraph was written to take into account schemes such as data:, where an additional mechanism for encoding octets (base64) is used. My understanding is that even in a data: URI, I should still be able to replace "A" by "%41", and it should still resolve to the same data. |
|
action:
Roy T. Fielding,
02 Mar 2003,
draft 01:
I removed the misleading first sentence and replaced it with a later example of a scheme defined as requiring UTF-8. |
036-host-escaping | %HH escaping should be allowed on hostname |
---|---|
closed | characters |
report:
Martin Duerst,
22 Jul 2002,
URI-WG mailing list:
Update the syntax of host names: Currently, this is one of the only places where %hh-escaping isn't allowed. Implementations are mixed, some browsers e.g. accept http://www.w%33.org while others don't. So this may go under "(b) document variations in current practice, as warnings to implementors." below. With Internationalized Domain Names, allowing %hh in host names is necessary for consistency. The actual text is currently in http://www.ietf.org/internet-drafts/draft-ietf-idn-uri-02.txt, and there is some chance that the IDN WG moves this forward. But in either way, it should be folded into the URI spec. |
|
action:
Roy T. Fielding,
26 Feb 2003,
issues list:
I do not think it is appropriate for the URI spec to suggest that users give hostnames in forms that are unacceptable to DNS. This is better solved by using IDNA encoding without changing the URI syntax. |
|
action:
Roy T. Fielding,
20 Mar 2003,
URI BOF:
Further discussion with Martin revealed that the reason IRI processors could not convert to IDNA forms automatically was because of the potential for a reg-name syntax. It was decided at the URI BOF that the best solution would be to remove reg-name, since nobody has used it anyway, thus clearing the way for IRI conversion to take place prior to URI handling. |
037-uri-comparison | define how to compare URIs |
---|---|
added 01 | characters |
report:
Tim Bray,
21 Feb 2003,
URI-WG mailing list:
In connection with the work of the W3C TAG, I undertook the task of documenting in-the-field practices as to how software can and should go about the very common task of comparing URIs. The latest draft of this, which I think represents TAG consensus, is at http://www.textuality.com/tag/uri-comp-4.html |
|
action:
Roy T. Fielding,
02 Mar 2003,
draft 01:
I have added most of the URI comparison document to section 6, with appropriate rewrites where necessary. I also modified the descriptions of escaping and unreserved to be (hopefully) clearer. |
038-qualified | qualified production in hostname is ambiguous |
---|---|
fixed 02 | hostname |
report:
Graham Klyne,
02 Feb 2003,
URI-WG mailing list:
Ref: [[ hostname = domainlabel [ qualified ] qualified = *( "." domainlabel ) [ "." toplabel [ "." ] ] domainlabel = alphanum [ 0*61( alphanum / "-" ) alphanum ] toplabel = alpha [ 0*61( alphanum / "-" ) alphanum ] alphanum = ALPHA / DIGIT ]] I think the syntax production 'qualified' is ambiguous (i.e. permits more than one parse tree for some valid values). consider: .abc.def is this "." <domainlabel> "." <toplabel> or "." <domainlabel> "." <domainlabel> ? I think the production could be written thus: qualified = *( "." domainlabel ) [ "." toplabel "." ] |
|
report:
Clive D.W. Feather,
02 Mar 2002,
URI-WG mailing list:
Is this the only place "qualified" is used? If so, then there's a further ambiguity - if a hostname consists only of a single domainlabel, is it followed by a zero-length qualified or not. I would suggest that the correct resolution is either: hostname = domainlabel [ qualified ] qualified = *( "." domainlabel ) "." toplabel [ "." ] if you want to forbid hostnames like "abc.123", or: hostname = domainlabel [ qualified ] qualified = 1*( "." domainlabel ) [ "." ] or hostname = domainlabel [ qualified ] [ "." ] qualified = 1*( "." domainlabel ) (these are not equivalent) if you want to allow them. |
|
action:
Roy T. Fielding,
03 Mar 2003,
issues list:
Fixed in draft 01. |
|
report:
Graham Klyne,
05 Mar 2002,
URI-WG mailing list:
I have to say that the 'hostname' syntax as specified an RFC2396bis is a pain to parse accurately. I think it's sufficiently difficult to get exactly right that it won't be correctly implemented as specified in many applications, which leaves me wondering if it really should be so fussily correct with respect to domain name usage. (The reason I'm noticing this is that I've been using the URI parsing task to experiment with some programming tools and techniques that offer a more direct correspondence between specification and the source code. If I were doing this as part of a real application, I would long ago have ignored the detailed syntax and done something very similar but much easier to implement.) The problem is in the production for 'qualified'. To determine whether an incoming ".abc" is a 'domainlabel' or a 'toplabel' requires a significant lookahead, to the following '.' (if present) and the character following that. To determine if an incoming ".123" is valid can require an arbitrarily long lookahead (e.g. http://0.123.4.5.6.7.8.9.10.11.12.13.x/). I think parsing precisely according to the syntax would be greatly simplified if the syntax were relaxed so that: qualified = *( "." domainlabel ) [ "." ] i.e. drop the syntactic prohibition of URIs like this: http://www.example.123./foo I appreciate this is not strictly correct, but I see no practical harm from defining the syntax in this way and asserting the form of the final domain label as an extra-syntactic constraint. A (limited) few tests with my browser suggest that it does not syntactically prohibit numeric top-level domain labels, but simply reports that the domain cannot be found. ... If you really want to keep the syntactic constraint in place, I suggest an alternative approach: hostname = qualified qualified = numericlabel "." qualified / toplabel [ "." [qualified] ] numericlabel = DIGIT [ 0*61( alphanum / "-" ) alphanum ... I think there's a typo in the syntax production for 'toplabel': s/alpha/ALPHA/ ? |
|
action:
Roy T. Fielding,
18 Mar 2003,
issues list:
Reopened. It would be best to have a syntax that was both unambiguous and easy for LALR parsers to process, but that may require too many changes. |
|
action:
Roy T. Fielding,
14 May 2003,
draft 02:
Fixed as suggested: qualified = *( "." domainlabel ) [ "." ] with additional text added for disambiguation of host. |
039-LALR-BNF | BNF should be more LALR-parser friendly |
---|---|
fixed 02 | bnf |
report:
Graham Klyne,
27 Feb 2002,
URI-WG mailing list:
I'm finding there are a number of other areas in which the grammar is ambiguous. Unfortunately, using a "greedy" parse approach doesn't always work, since some parts of the grammar fail if the previous sections are greedy-matched; e.g. domainlabel; qualified. (These are pretty picky points, which I'm noticing because I've tried to build a functional-language parser directly from the grammar as given. I've got all test cases parsing OK now, but I've had to add in a few messy patches to get the kind of behaviour I'd expect from a "normal" parser.) |
|
action:
Roy T. Fielding,
10 Apr 2003,
issues list:
It would be best to have a syntax that was both unambiguous and easy for LALR parsers to process, but that may require too many changes. |
|
report:
Rob Cameron,
05 May 2003,
URI-WG mailing list:
The production rule for path is a bit problematic. path = [ abs-path / opaque-part ] - it is not used in the grammar - presumably, it is meant to say that whatever is parsed as either abs-path or opaque-part is interpreted as a "path". - the production does not include rel-path, but rel-path needs to be processed as a path for the algorithms in 5.2 |
|
action:
Roy T. Fielding,
23 May 2003,
draft 02:
The ABNF for URI and URI-reference has been redesigned to make them more friendly to LALR parsers and significantly reduce complexity. As a result, the layout form of syntax description has been removed, along with the uric-no-slash, opaque-part, and rel-segment productions. All references to "opaque" URIs have been replaced with a better description of how the path component may be opaque to hierarchy. The fragment identifier has been moved back into the section on generic syntax components and within the URI and relative-URI productions, though it remains excluded from absolute-URI. The ambiguity regarding the parsing of URI-reference as a URI or a relative-URI with a colon in the first segment is now explained and disambiguated in the section defining relative-URI. |
040-reg-name | Remove registry-based name syntax from authority |
---|---|
fixed 02 | authority |
report:
Martin Duerst,
20 Mar 2003,
URI BOF:
In order for internationalized characters in the authority component to be handled directly by an IRI processor, it must either a) be able to encode the authority characters as %hh and rely on gethostbyname to do the conversion, or b) know that the scheme uses hostport and not registry-based names and thus be able to convert the hostname to IDNA form. |
|
action:
Roy T. Fielding,
20 Mar 2003,
URI BOF:
Note that IDNA was created specifically to avoid (a), so that doesn't seem to be a viable alternative for the IETF. The reg-name production will be removed, since we do not have enough independent implementations of that syntax to justify its existence. That will allow IRI processors to be scheme-independent and simply convert to IDNA based on the presence of a hostname. |
|
action:
Roy T. Fielding,
27 Apr 2003,
draft 02:
Registry-based naming authorities that use the hierarchical authority syntax component are now limited to DNS hostnames, since those have been the only such URIs in deployment. This change was necessary to enable internationalized domain names to be processed in their native character encodings at the application layers above URI processing. The reg_name, server, and hostport productions have been removed to simplify parsing of the URI syntax. |
041-encoding | Section 2 on encoding causes too much confusion |
---|---|
fixed 02 | characters |
report:
Stefan Eissing,
31 Jan 2003,
URI-WG mailing list:
It is context dependant if '%61' can be considered equivalent to the charcter 'a' or not. The argument basically is that RFC 2396 allows other character encodings than US-ASCII and that '%61' could denote basically any character unless the character encoding becomes known. I argue that any 7 bit octet, escape-encoded in an URI, it MUST be equivalent (apart from reserved characters like %2f) to its US-ASCII character. In my opinion, RFC 2396 already defines this: In RFC 2396, Ch. 2.1 "In the simplest case, the original character sequence contains only characters that are defined in US-ASCII, and the two levels of mapping are simple and easily invertible: each 'original character' is represented as the octet for the US-ASCII code for it, which is, in turn, represented as either the US-ASCII character, or else the "%" escape sequence for that octet." In RFC 2396, Ch. 2.4.2: "For example, "%7e" is sometimes used instead of "~" in an http URL path, but the two are equivalent for an http URL." Accordings to this, my argument should be valid at least for HTTP URIs. |
|
report:
Martin Duerst,
22 Feb 2003,
URI-WG mailing list:
The characters in a URI (the ones that are compared character-by- character in namespaces) are just that, characters. URIs are defined independent of any particular representation. The URI spec says that /dir/a and /dir/%61 are equivalent, independent of the representation. They are equivalent if they appear in ASCII. They are equivalent if they appear on paper, on the side of a bus, and so on. They are equivalent when spoken over the radio. And they are equivalent when encoded as UTF-16 (as your Java example shows) or in EBCDIC. RFC 2396 gives three levels, condensed in the following line: URI character sequence->octet sequence->original character sequence In practice, there are two more layers, one on each side. We then get: a) substrate: paper, metal, audio waves, ascii, UTF-16, EBCDIC,... We don't want to limit that to a particular encoding. ^ | conversion depending on substrate representation V b) URI character sequence (just characters) ^ | conversion defined by RFC 2396 (always US-ASCII!) V c) octet sequence (just octets) ^ | conversion currently scheme/server dependent, moving towards UTF-8 V d) original character sequence (file names on server, query strings,...) ^ | conversion server-dependent V e) original octet sequence (e.g. UTF-16 for a filename on WinNT, EBCDIC on an EBCDIC server, and so on) Maybe this diagram should go into the new version of RFC 2396. |
|
report:
Misha Wolf,
22 Feb 2003,
URI-WG mailing list:
The one piece of terminology I have some trouble with, and which is already in RFC 2396, is the phrase "original character sequence". Presumably, the sequence is "original" in the sense that the entity managing the resource has used this character sequence (eg a file pathname) to identify it. If that is the case, then the problem I have is simply due to the, possibly selfish, perception that the characters I enter into the browser's address box are the "original" characters and that these are transformed in various ways before arriving at the entity managing the resource. The direction of the arrows in the RFC 2396 diagram strengthens this way of perceiving the flow. I wonder whether some word other than "original" would be clearer? |
|
report:
Martin Duerst,
24 Feb 2003,
URI-WG mailing list:
I repeat: if I'm on an EBCDIC computer, and the URI reads out as /dir/a, that is *different* from /dir/%61. Yes, this is egregiously broken and stupid, but it's within the bounds set by RFC2396. I agree that it may not be extremely clear. But I disagree that your interpretation is within the bounds of RFC 2396. For example, in "2. URI Characters and Escape Sequences", we have: Within a URI, characters are either used as delimiters, or to represent strings of data (octets) within the delimited portions. Octets are either represented directly by a character (using the US- ASCII character for that octet [ASCII]) or by an escape encoding. This representation is elaborated below. Now let's take your example, "/dir/a". Let's assume that's a directory name 'dir' and a file name 'a' on a computer that uses EBCDIC. We don't have to care about the '/' here, because this is a separator that is part of the URI syntax, independent of local usage (see e.g. MSWin). So now let's look at how the ebcdic server exposes 'dir' and 'a'. It can either decide to expose them as EBCDIC (which makes server implementation easier) or to expose them as ASCII (which makes the URI more readable). If the server on the EBCDIC system decides to expose as EBCDIC, then this will give us the following octets: /<84><89><99>/<81> This then results in an URI of /%84%89%99/%81. There is no other choice, as we have in "2.4.1. Escaped Encoding" An escaped octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the octet code. (Well, you could claim that instead of %84, it may also be %48, because the RFC doesn't say which order the digits go, but I hope you don't want to go there.) For an example that is a bit different, let's say '/d+r/a', we would get /<84><78><99>/<81> in terms of octets, and then /%84N%99/%81 in the actual URI (because the RFC clearly says that the octet <78> is encoded with US-ASCII, which results in an 'N'. We could also use /%84%78%99/%81. The other alternative is to expose the resource as US-ASCII, i.e. have the conversion work being done on the server. In that case, we have /<64><69><72>/<61>, which trivially results in /dir/a. It could of course also result in /dir/%61, because %61 is the escape for octet <61>. Please remember that it says: Octets are either represented directly by a character (using the US- ASCII character for that octet [ASCII]) or by an escape encoding. This representation is elaborated below. So overall, the server can make the choice of how to expose a resource name as a series of octets. But it doesn't have a choice to expose the resource name as one octet if the octet is escaped, an as another octet if the octet is not escaped. |
|
report:
Tim Bray,
07 Mar 2003,
URI-WG mailing list:
In explaining matters of character encoding, section 2.1 envisions something sort of standing behind the URI, the phrase original character is used (occasionally in quotes), as well as "original character sequences" (not in quotes). So maybe there's a notion of an "original URI" hiding behind the URI? This is confusing because the "original URI" might differ from the actual URI because (a) it contains ASCII characters which are reserved, e.g. '/' or '%' (b) it contains non-ASCII characters (c) it contains non-character octets A question: what gets painted on the side of a bus? The URI or the "original" behind it? The answer is probably "The URI", except for case (b), when it might become an IRI with the native non-ASCII characters appearing on the side of the bus. (c) is kind of confusing and counter-intuitive, but is the only way I can explain the baffling language about mapping from characters to octets, and the phrase in 2.2 "The data for a URI component". If section 2 were redrafted as follows, all the ambiguity and hand-waving would be squashed like a bug. =============================================== 2. Characters and URIs [New title] A URI consists of a restricted set of characters, primarily chosen to aid transcribability and usability both in computer systems and in non-computer communications. Characters used conventionally as delimiters around a URI are excluded. The restricted set of characters consists of digits, letters, and a few graphic symbols chosen from those common to most of the character encodings and input facilities available to Internet users. uric = reserved / unreserved / escaped Within a URI, characters are either used as delimiters or to represent strings of data (octets) within the delimited portions. [Same as now except lose last 2 sentences.] 2.1 Encoding of Characters In the general case, there are many mappings between characters as abstractions comprising the smallest atomic units of text and the octets used to store them in a computer. The US-ASCII standard specifies not only a set of characters but a particular mode of storage where each character's numeric value (in the range 0-127) is stored directly in a single octet. Note that many widely deployed systems for storing characters which include non-ASCII characters nonetheless store ASCII characters as specified by US-ASCII directly one per octet. This includes Shift-JIS, EUC, UTF-8, and ISO-8859 (all parts). This RFC does not mandate the use of any particular mapping between its character set and octets of computer storage. 2.2 The Characters in the URI Scheme The "scheme" part of a URI consists of a sequence of ASCII characters which represent nothing except themselves. 2.3 The Characters in Non-Scheme Parts of the URI The ASCII characters making up a component of a URI other than the scheme may represent an arbitrary sequence of octets. The definitions of URI schemes MUST specify the interpretation of the characters in the components of URIs of that scheme. There are some constraints on these interpretations: - The interpretation MUST conform to the productions in this RFC, i.e. cannot rely on using a character which is forbidden to appear in the component. - The interpretation must be consistent: two instances of a URI component which are equal in length and made of pairwise-identical ASCII characters MUST represent the same octets. - The character "%" MUST always be followed by two hexadecimal values encoding the numeric value of a single octet. The hexadecimal digits 'A' through 'F' are used identically to the digits 'a' through 'f', so that two URI components which differ only in the case of hexadecimal digits used in %-encoded octets may safely be considered identical. 2.4 Textual URIs Many schemes may wish to constrain the components of URIs to encode textual data, consisting only of characters from Unicode (ISO10646). This section describes a procedure for encoding textual data for use in URIs. Schemes which describe textual URIs SHOULD use the procedure described in this section to generate URI components from textual data. - ASCII characters which may legally appear in the component MUST appear directly as themselves, i.e. 'a' may not be encoded as %61. - ASCII characters which may not legally appear in the component MUST be %-encoded using the numeric value specified by the US-ASCII standard, using the upper-case hexadecimal digits 'A' - 'F'. i.e. '<' must always be encoded as %3C. - Non-ASCII characters MUST be converted to a sequence of octets as specified by UTF-8, with each octet then %-encoded. I.e. Ç (U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA) must always be encoded as %C7%65. =============================== I think most of the rest of section 2 is pretty well OK. |
|
report:
Stefan Eissing,
10 Mar 2003,
URI-WG mailing list:
There is also chapter 1.5 (Transcribability) which uses the term URI both for the thing on the side of a bus and a string conforming to the EBNF rules of the RFC. Also, the last sentence of 1.5 should probably also be removed since now 6.3 recommends UTF-8 usage. |
|
action:
Roy T. Fielding,
02 May 2003,
draft 02:
Section 2.1 has been rewritten. I started with the text provided by Tim Bray, but soon found that much of it was simply repeating information that is explained better in later sections. So, I replaced it with the meat of the explanation (what should happen when characters get encoded), added the simplest recommendation for UTF-8, and specified the rest by clarifying the appropriate sections below it. |
042-fragment-when | fragment identifiers applied before entire content is retrieved |
---|---|
fixed 02 | fragment |
report:
Larry Masinter,
17 Apr 2003,
URI-WG mailing list:
During the discussion of temporal fragment identifiers, I've noted that some of the wording in RFC 2396 might need some minor tweaking: When a URI reference is used to perform a retrieval action on the identified resource, the optional fragment identifier, separated from the URI by a crosshatch ("#") character, consists of additional reference information to be interpreted by the user agent after the retrieval action has been successfully completed. .... The fragment identifier can be interpreted by the user agent before "the retrieval action has been successfully completed" but after it's been successfully initiated. For example, in HTML pages, the browser can scroll to the identified fragment as soon as it's been parsed, and not wait until the _entire_ document has been retrieved. Similarly, if you open a PDF file with a page identifier http://acroeng.adobe.com/BrowserTestSuite/auxurl/auxurl_testpage.html it doesn't need to download the entire file before it can turn to the appropriate page, using byte range retrieval. |
|
report:
Mark Baker,
17 Apr 2003,
URI-WG mailing list:
In the context of interpretation and not processing (which I'd classify the scroll action you describe as), I'd say that the fragid can be interpreted as soon as the (authoritative) media type of the representation is determined. |
|
action:
Roy T. Fielding,
15 May 2003,
draft 02:
This has been fixed in the rewrite for draft 02. |