003-relative-query | inconsistent resolution of query-only relative URI |
fixed 00 | relative |
report:
Miles Sabin,
23 Mar 1999,
private mail:
I've been working through the relative URI resolution
mechanism in RFC 2396, and I've spotted something which
seems a little odd. The example resolution on p.29 for,
?y
from,
http://a/b/c/d;p?q
is given as,
http://a/b/c/?y
but as far as I can make out, the resolution algorithm
suggests the result ought to be,
http://a/b/c/d;p?y
which is the result that was given in RFC 1808. It's
also the result that both Netscape 4 and IE 4 deliver.
Given that this would be an observable change in
behaviour between the two RFCs, I'm a little surprised
that it wasn't flagged up as such if the change really
was intended ...
Strangely enough, Sun's badly broken java.net.URL class
_does_ give the result specified in 2396, which makes me
suspect that something must be wrong ;-)
|
report:
Henry Holtzman,
09 Jul 2002,
private mail:
rfc2396 specifies a different browser behavior from rfc1808 in a particular
situation that I believe may be unintentional. IE & Netscape implement the
rfc1808 behavior while Opera implements the rfc2396 behavior. As appendix
G of rfc2396 makes no mention of this change, we would appreciate your
opinion on the matter.
In rfc1808, when the relative URL has no path component, but has a fragment
or a query, the client is supposed to skip step 6 of forming the absolute
URI. In step 6, among other things, the base URI is stripped of all
characters beyond the final "/".
In rfc2396, when the relative URI has no path and has a fragment, it is
specified that processing should be stopped as no new document should be
loaded, but rather navigation within the document is specified. This
change is explained in appendix G.
However, when there is no path component, but there is a query component,
processing continues. The instruction to skip stripping the
post-final-/-characters is gone in rfc2396, which means that the final part
of the base URI is stripped and so the query is not performed on the same
page as was loaded (unless that page's URI ended with a "/". Was this
change between rfc1808 and rfc2396 intended?
The following small php application illustrates the issue. You can run it
at http://www.media.mit.edu/opera/r-url.php. You will note that Opera
(6.03) behaves very differently from Netscape and IE when executing this
page. With IE and Netscape, you can navigate within the application. With
Opera, when you click on the links within the app, you get an index page of
the directory containing the app.
It is my belief that the final characters should *not* be stripped, and
that rfc2396 should be amended to skip the stripping in the case of a
relative URI with only a query component.
<html>
<head>
<title>Example application using empty path relative URLs</title>
</head>
<body>
<h4>Example application using empty path relative URLs</h4>
<?php if ($action=="here") { ?>
Thank you for clicking here!<br><br>
<?php } else if ($action=="there") { ?>
Hey, you weren't supposed to click there!<br><br>
<?php } ?>
Please click <a href="?action=here">here</a>.<br>
Please do not click <a href="?action=there">there</a>.<br>
<br>
Thank you.
</body>
</html>
|
action:
Roy T. Fielding,
14 Oct 2002,
draft 00:
Fixed by rewriting the algorithm as pseudocode and restoring the
original RFC 1808 behavior, with the example changed accordingly.
|
007-empty-rel_path | relative URI syntax does not allow empty path |
fixed 00 | relative |
report:
Reese Anschultz,
17 Feb 2000,
private mail:
I have an observation regarding section -- "C. Examples of Resolving
Relative URI References" -- within this document.
The document cites that given the well-defined base URI of
http://a/b/c/d;p?q
relative URI
?y
would be resolved as follows:
http://a/b/c/?y
By my interpretation from the BNF, a query can exist as either
relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
or
hier_part = ( net_path | abs_path ) [ "?" query ]
Since net_path, abs_path and rel_path must each be a least one character in
length, I believe that the example "?y" is not a valid URI because no
characters proceed the question mark (?).
|
report:
Henry Zongaro,
12 Nov 2001,
RFC editor:
Appendix C shows an example of a relative URI Reference of "?y" with
respect to the base URI "http://a/b/c/d;p?q". However, according to the
collected syntax that appears in Appendix A, "?y" doesn't appear to be a
valid relative URI reference. The syntactic category URI-reference must
begin with an absoluteURI, a relativeURI or a pound sign. An absoluteURI
begins with a scheme, which cannot begin with a question mark; a
relativeURI begins with a net_path or abs_path, both of which begin with a
slash, or with a rel_path. A rel_path begins with a non-empty
rel_segment, which again cannot begin with a question mark.
|
report:
Bruce Lilly,
16 Jan 2002,
private mail:
Section C.2 mentions an empty reference, but the
formal syntax does not provide for that. There are
several possible changes to the formal syntax which
would permit it, e.g. change 1* to * in the
definition of rel_segment, which would permit an
empty rel_path and therefore relativeURI (however,
it would then permit a relativeURI consisting of
"?" query, which might not be desired).
Alternatively, the entire RHS of the relativeURI
definition could be bracketed, i.e. made optional,
which would permit an empty relativeURI without
permitting a lone delimited query.
|
action:
Roy T. Fielding,
20 Mar 2000,
private mail:
I don't even remember making this change, but it was broken
when draft-fielding-uri-syntax-02.txt changed from
rel_path = [ path_segments ] [ "?" query ]
to (in 03):
rel_path = rel_segment [ abs_path ]
rel_segment = 1*( unreserved | escaped |
";" | "@" | "&" | "=" | "+" | "$" | "," )
|
action:
Roy T. Fielding,
14 Sep 2002,
draft 00:
Fixed by making the path optional in the ABNF:
2396:
relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
hier_part = ( net_path | abs_path ) [ "?" query ]
draft-00:
relative-URI = [ net-path / abs-path / rel-path ] [ "?" query ]
hier-part = [ net-path / abs-path ] [ "?" query ]
|
008-URIvsURIref | URI versus URI Reference |
fixed 02 | terminology |
report:
Larry Masinter,
26 May 2000,
xml-uri mailing list:
When we update RFC 2396, I suggest we add an introductory paragraph
explaining that the term "URI" is used ambiguiously in the community
to mean "a URI reference" (corresponding to the URI-reference BNF entity)
or "an absolute URI", and that for this reason, the term "URI" itself
is not defined in the document.
I'd probably fix the Abstract correspondingly, e.g.,
"Informally, a Uniform Resource Identifier is a compact string...."
so that people don't think that the abstract is normative.
|
report:
Jeff Hodges,
01 Jun 2001,
URI-WG mailing list:
It seems to me, in considering points raised in the "Are URI-References bound
to resources?" thread, that some subtleties might be a bit more clear if
changes along the following lines were made to RFC 2396 (i.e. in a future
revision of that doc, if any)..
4. URI References
The term "URI-reference" is used here to denote the common usage of a
^^^^ ^^^^^^^^^^^^^^^ ^
production (delete) s
resource identifier. A URI reference may be absolute or relative,
^
The term "URI reference" is a casual (i.e. natural
language) description for artifacts that are parsable
using the "URI-reference" production.
and may have additional information attached in the form of a
fragment identifier. However, "the URI" that results from such a
reference includes only the absolute URI after the fragment
identifier (if any) is removed and after any relative URI is resolved
to its absolute form. Although it is possible to limit the
discussion of URI syntax and semantics to that of the absolute
result, most usage of URI is within general URI references, and it is
impossible to obtain the URI from such a reference without also
parsing the fragment and resolving the relative form.
URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(delete)
add: URI = absoluteURI | relativeURI
add: URI-reference = [ URI ] [ "#" fragment ]
.
.
.
It seems to me that the above suggested re-write of the URI-reference
production, and the additions to the preceding text, would make it easier and
clearer to talk about "URI" artifacts and "URI-reference" artifacts and their
different abstract semantics.
Also, the _term_ "URI reference" isn't defined prior to section 4 (wherein it
is only tangentially defined, imho). Terms that are also used in sections
prior to section 4 whose explicit definition would help the document convey
it's rather abstract notions to the reader are: "document" and "reference".
Explicitly defining how those terms are used and what their semantics are in
the context of URI and URI-reference artifacts are, would be immensely helpful
to readers.
|
report:
Tim Berners-Lee,
23 Jan 2003,
URI-WG mailing list:
I would very much like us to take the opportunity to clean up the terminology
on the URI spec which has confused people. It is my considered opinion that
this would be far preferable:
URI - the actual identifier string, with or without a #fragid.
URI reference - a string used in a language to specify a URI, for which
relative form may be used where a base exists. ((This is not the only way of
specifying the value of a URI - one can use various
character sets, namespace prefixes, etc))
|
action:
Roy T. Fielding,
23 May 2003,
draft 02:
An ABNF production for URI has been introduced to correspond to the
common usage of the term: an absolute URI with optional fragment.
The fragment identifier has been moved back into the section on
generic syntax components and within the URI and relative-URI productions,
though it remains excluded from absolute-URI. The entire text of the
specification has been revised accordingly.
|
012-simplify-IPv6 | change BNF to incorporate IPv6 better than RFC 2732 |
added 00 | IPv6 |
report:
James Clark,
20 Jul 2001,
URI-WG mailing list:
The XML schema anyURI simple type allows any string which after escaping
disallowed characters as described in Section 5.4 of XLink is a URI
reference as defined in RFC 2396, as amended by RFC 2732. This raises the
question of what exactly it takes for an implementation to check this.
Putting on one side the RFC 2732 amendments (and the consequent
non-escaping of square brackets by the XLink algorithm), I believe it's
very simple. To check a string, do the following:
1. Check that every % is followed by two hex digits.
2. Check that there is at most one # character in the string.
3. If the string contains a ":" character that precedes all "/", "?" and
"#" characters, then the string is an absolute URI and the substring
preceding the first such colon must match the regex [a-zA-Z][-+.a-zA-Z0-9]*.
4. If the string is an absolute URI (as in 3), the the first colon must not
be immediately followed by a # or the end of the string. (For example,
"foo:" and "foo:#bar" are illegal.)
I think that's it. It's not straightforwatd to deduce this from RFC 2396
and XLink, so I am not 100% confident.
RFC 2732 seems to radically complicate things. It adds "[" and "]" to the
set of reserved characters and removes them from unwise. This has the
effect of allowing square brackets in the query component and the fragment
component. The first problem arises with the path component. Since pchar
is defined in RFC 2396 as
unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
it is unaffected by RFC 2732 and thus square brackets are not allowed in
the path component. This is a little bit strange, since intuitively pchar
is an any uric other than "/", "?" and ";", but it complicates checking
only a little.
The big problem is with the authority component. Before RFC 2732, checking
generic URI syntax did not require any complex parsing of the authority
component, because an authority can be a reg_name, which allows one or more
of any uric other than "/" and "?". The problem is that because reg_name
is defined as:
1*( unreserved | escaped | "$" | "," |
";" | ":" | "@" | "&" | "=" | "+" )
it is unaffected by RFC 2732. Thus square brackets are not allowed to
appear arbitrarily in the authority component, but can only appear if the
authority component matches the server production (as amended by RFC 2732).
This means that a generic URI checker now has to do a complex parse of the
authority component.
This seems completely at variance with the intent of section 3.2.1 of RFC
2396:
"The structure of a registry-based naming authority is specific to the URI
scheme, but constrained to the allowed characters for an authority
component."
I would therefore suggest at a mininum that RFC 2732 should be fixed to
allow "[" and "]" in reg_name. I also think it would be cleaner and more
in harmony with RFC 2396 to also allow them in the path component. In
terms of the BNF I would suggest introducing an other_reserved symbol:
other_reserved = "&" | "=" | "+" | "$" | "," | "[" | "]"
Then in each place in RFC 2396 replace occurrences of
"&" | "=" | "+" | "$" | ","
(specifically in uric_no_slash, rel_segment, reg_name, userinfo, pchar,
reserved) by a reference to other_reserved. I believe this would also make
the BNF in RFC 2396 easier to understand.
|
report:
Grégoire Vatry,
04 Apr 2002,
private mail:
I report what I suspect to be an error in RFC 2732 which updates RFC 2396.
I suspect that 'uric_no_slash' set of characters has been forgotten
in the list of changes made to the URI generic syntax by RFC 2732.
Here is my line of argument:
Since:
1. The set 'uric_no_slash' stands for "same as 'uric' BUT without slash";
2. The set 'uric' is defined as:
uric = reserved | unreserved | escaped
3. Slash ("/") is part of 'reserved' set;
4. Set of 'reserved' characters is modified in RFC 2732.
As a result, point (3) of section 3. in RFC 2732 should be:
(3) Add "[" and "]" to both the set of 'reserved' characters and
the 'uric_no_slash' set:
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | "," | "[" | "]"
uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
"&" | "=" | "+" | "$" | "," | "[" | "]"
and remove them from the 'unwise' set:
unwise = "{" | "}" | "|" | "\" | "^" | "`"
|
action:
Brian E. Carpenter,
04 Apr 2002,
private mail:
This indeed appears to be an oversight, thanks. Larry Masinter is thinking about
combining these two RFCs in their next update so this needs to go on his list.
|
action:
Larry Masinter,
04 Apr 2002,
URI-WG mailing list:
I agree that this is an error in RFC 2732, and should be
folded in when we merge RFC 2732 with RFC 2396. We would
need two independent interoperable implementations of
RFC 2732 (with ipv6 addresses), though.
|
action:
Roy T. Fielding,
22 Oct 2002,
issues list:
Adding square brackets to uric_no_slash is fine, since it only affects
the opaque URI syntax. However, adding it to the other places that
James Clark suggested would allow square brackets to be used anywhere,
which is simply unwise (and why they were not allowed at all before).
I can understand why IPv6 chose square brackets as delimiters, but
allowing them in path, query, and fragment would cause too many
interoperability issues with deployed systems.
|
action:
Roy T. Fielding,
26 Oct 2002,
draft 00:
IPv6 literals have been added to the list of possible identifiers
for the host portion of a server component, as described by RFC 2732,
with the addition of "[" and "]" to the reserved, uric, and
uric-no-slash sets. Square brackets are now specified as reserved
for the authority component, allowed within the opaque part of an
opaque URI, and not allowed in the hierarchical syntax except for
their use as delimiters for an IPv6reference within host. In order
to make this change without changing the technical definition of
the path, query, and fragment components, those rules were redefined
to directly specify the characters allowed rather than continuing
to be defined in terms of uric.
|
017-rdf-fragment | RDF does not believe in same-document references |
fixed 02 | fragment |
report:
Jeremy Carroll,
10 Apr 2002,
URI-WG mailing list:
This is a comment about RFC 2396 that I have been actioned to send on behalf
of the W3C RDF Core Working Group [1]
The key issue concern resolving same document references and/or resolving
against non-hierarchical URIs.
These have been causing us difficulty in using xml:base
As one of our deliverables we produce test cases [2].
A summary table of our URI resolution problems is as follows;
the answers we have agreed are in the attached HTML file.
EASY:
a "http://example.org/dir/file" "../relfile"
b "http://example.org/dir/file" "/absfile"
c "http://example.org/dir/file" "//another.example.org/absfile"
GETTING HARDER:
d "http://example.org/dir/file" "../../../relfile"
e "http://example.org/dir/file" ""
f "http://example.org/dir/file" "#frag"
MASTER CLASS:
g "http://example.org" "relfile"
h "http://example.org/dir/file#frag" "relfile"
i "http://example.org/dir/file#frag" "#foo"
j "http://example.org/dir/file#frag" ""
k "mailto:Jeremy_Carroll@hp.com" "#foo"
l "mailto:Jeremy_Carroll@hp.com" ""
m "mailto:Jeremy_Carroll@hp.com" "relfile"
We have reached consensus on and approved all these tests except for the
last which some of us consider an error and others resolve as indicated in
the html file.
The rationales for our views are approximately as follows:
d "http://example.org/dir/file" "../../../relfile"
[[[RFC2396
In practice, some implementations strip leading relative symbolic
elements (".", "..") after applying a relative URI calculation, based
on the theory that compensating for obvious author errors is better
than allowing the request to fail.
]]]
Not permitted in RDF/XML.
e,f,i,j,k,l
Base does apply to same document references in RDF/XML
g
Failure to insert / is a bug with RFC 2396
h,i,j
Strip frag id from base uri ref before resolving.
Notice j is particularly surprising.
k,l
Same document reference resolution even works for non-hierarchical uris.
m
- no consensus
The test suite is structured as follows:
The positive tests on the test cases web site show a usage of xml:base in
RDF/XML and the resolution of that usage in terms of the RDF graph produced
(with absolute URI ref labels). Each test consists of two files, an RDF/XML
document and an n-triple file (substitute .rdf with .nt in the URL), being a
list of the edges of the graph.
The negative test case shows possibly illegal usage of xml:base in RDF/XML.
[1] http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Apr/0008.html
[2] http://www.w3.org/2000/10/rdf-tests/rdfcore/xmlbase/
|
report:
Jeremy Carroll,
15 Apr 2002,
URI-WG mailing list:
I do not recall the RDF Core WG having resolved a justification of the
decision in favour of the these test cases. Hence I will give my own
justification.
First:
The actual decisions of the RDF Core WG reflect what 'same document
references' mean within an RDF/XML document within the scope of an xml:base
attribute. Primarily the WG decisions reflect the meaning of RDF/XML rather
than XML Base of RFC 2396. However, these decisions do point to weaknesses
in RFC 2396.
The RDF Core WG has consistently (with or without xml:base) interpreted all
uri references as absolute uri references. The decisions clarify that when
the normal uri resolution mechanisms deliver a same document reference, we
form the absolute uri ref using the currently in scope xml:base uri.
Second:
The definition of same-document references is unfortunately focussed on
browsing:
[[[
4.2. Same-document References
A URI reference that does not contain a URI is a reference to the
current document. In other words, an empty URI reference within a
document is interpreted as a reference to the start of that document,
and a reference containing only a fragment identifier is a reference
to the identified fragment of that document. Traversal of such a
reference should not result in an additional retrieval action.
However, if the URI reference occurs in a context that is always
intended to result in a new request, as in the case of HTML's FORM
element, then an empty URI reference represents the base URI of the
current document and should be replaced by that URI when transformed
into a request.
]]]
line 3 "start of that document" is meaningless for an RDF document.
RDF is a graph and is not a linear structure.
line 6 "no additional retrieval action" All URIrefs in RDF are absolute, and
none are retrieved accept when the application content "is always intended
to result in a new request".
The RDF Core is trying to clarify which absolute URI ref corresponds to a
same document ref.
line 9 The answer, at least for empty same document refs, it is the "base
URI".
We discover what a base URI is in section "5.1 Establishing a Base URI"
[[[
5.1. Establishing a Base URI
The term "relative URI" implies that there exists some absolute "base
URI" against which the relative reference is applied. Indeed, the
base URI is necessary to define the semantics of any relative URI
reference; without it, a relative reference is meaningless. In order
for relative URI to be usable within a document, the base URI of that
document must be known to the parser.
]]]
I note that the algorithm in
5.2. Resolving Relative References to Absolute Form
amongst its defects, does not implement line 9 of section 4.2.
Once we are dynamically changing the xml:base from one element to the next,
we are outside the design bounds of RFC 2396.
If we consider only documents with a single xml:base on their outermost
elements, then as far as RDF goes, the resolution of the same document test
cases is consistent with section 4.2 of RFC 2396. A same document
reference, like any uri ref, in an RDF file means an absolute URI ref. The
absolute URI ref is formed by taking "the base URI" of the document, as
suggested in line 9 of 4.2. The fragment part if taken from the same
document reference.
|
report:
Al Gilman,
15 Apr 2002,
URI-WG mailing list:
The bad news:
In fact, "the same document" in fragment-only relative references should be
taken even more locally and particularly than "the URI from which this
representation was recovered." The latter reading is inadequate, an error.
It should be read as "this representation." So the type is known, and with
it the semantics of #fragment references. Without recourse to _even_ the
URI from which it was recovered. As Paul suggested. For hyperlinks with
goTo semantics, where the absolute URI equivalent of the reference is
unnecessary, it is moot and therefore not defined. The best available
absolute reference (nearest to equivalent) would be base-ified using the
URI from which this representation was recovered, but that question has
no need and no standing in the case of following hyperlinks in browsing
the same "recovered representation." There is no general answer, absent
a universal document type (see next).
The good news:
The semantics of #fragment in "the current document" is governed by the
_type_ of the recovered represetation of the URI accessed. So for RDF
to apply the semantic constraint that a #fragment reference is equivalent
to a given absolute URI -- within a representation which belongs to a type
which by its type definition is bound to the constraints of the RDF
model -- is entirely within the purview of the specification of the
RDF model and the languages in which it is represented.
This violates the universality goal that any URI-reference can be used
in any place a URI-reference can be used, but that is a different matter.
This is also violated by having some references take anyURI and others
limited to IDREF in the same document. The RDF restriction to
absolute-URI-reference senses for fragment-URI-reference signs does not
violate RFC-2396, at least. This is just that the RDF model only admits
of 'absolute' references. So references in any syntax binding of the
RDF model will only contain 'absolute' URI-references.
|
report:
Brian McBride,
15 Apr 2002,
URI-WG mailing list:
First: the problem RDF is trying to solve. The current RDF specs have
encouraged the use of the following idiom:
<rdf:Description rdf:about="#foo">
...
The value of the rdf:about attribute is turned into an absolute URI
reference by concatenating the '#foo' with the URI of the containing document.
This causes problems. Folks copy the file from the web to their hard drive
so they can work on it in a plane, and the uri changes to something like
file:c:\temp\....rdf and this is really useless for rdf users. Or folks
wish to include RDF in say a message protocol where there is no base uri
of the document.
This is the cause of one of, if not the, most frequent newbie problem with
DAML that we see on jena-dev.
So we are looking for a way to retain this convenient syntax, but have the
uri's produced not change when the file is copied or mirrored.
To appreciate what is happening here, we need to look at a semi-fictional
RDF processing pipeline:
input xml document --
xml parser -- rfc2396 processor -- rdf parser -- rdf graph
We start with an xml document and end up with a datastructure. The
datastructure is not a DOM; its not a representation of an xml
document. It is as far as xml is concerned, an application data structure.
For each value of an rdf:about attribute, the rfc2396 processor outputs
either an absolute URI or a same document reference. The absolute URI is
processed according to RFC2396. Same document references are recognised
according to RFC 2396.
All is in conformance with rfc 2396 at this point.
Now the RDF parser comes in to play and it is required to transform the
value of each rdf:about attribute into an absolute uri reference. If the
RFC 2396 processor has produced an absolute uri reference, it need do
nothing. If however, it is a same document reference, then, just as a
browser will handle same document references specially, so does RDF. It
transforms the same document reference into an absolute URI according to an
algorithm defined by the RDF specs. The mimetype of an rdf document will
be text/xml+rdf. As far as xml base and rfc 2396 are concerned, this is
application code over which they have no say.
What I have tried to do here is to position RDF as an application built on
top of XML and to suggest that XML should not be allowed to express
constraints on how applications process it.
There is a deal of sophistry in this argument :( but RFC 2396 doesn't
really meet our needs. Are there any plans to update/refine it in the near
future?
|
report:
Brian McBride,
30 Jan 2003,
URI-WG mailing list:
Please review the RDFCore last call working drafts which are linked from
http://www.w3.org/2001/sw/RDFCore/#documents
Whilst we would welcome your comments on any and all aspects of these
documents, the WG particularly requests feedback on:
o the proposed used of xml:base, and especially its handling of
same document references
http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-ID-xml-base
http://www.w3.org/TR/rdf-syntax-grammar/#section-baseURIs
o the rdf interpretation of fragment identifiers
http://www.w3.org/TR/rdf-concepts/#section-fragID
The last call period for these documents ends on 21 Feb 2003.
|
report:
Graham Klyne,
05 Mar 2003,
URI-WG mailing list:
Is there a way to specify a fragment identifer relative to the document
in the current base URI. I can't see a way to do this.
If I use ./#frag, then the final path component of the base URI is omitted.
So I see no way of indicating a fragment of the base URI without including
some part of the base URI.
Er, "#frag", right? What am I missing?
According to the URI spec, that is relative to the *current document*,
as opposed to the current base URI. For example, when xml:base is used
within an XML document, the #frag is not (as I understand) relative to
the base URI.
The URI spec is quite explicit about stating that when resolving
#frag relative to some base URI, it refers to a fragment the
*current document* as distinct from the base URI;
cf. algorithm in section 5.2.
|
action:
Roy T. Fielding,
07 Mar 2003,
URI-WG mailing list:
Note that this issue is a request to change the "current document"
algorithm. This can be accomplished by changing the spec to remove
the bit about current document and instead replace the empty URI with
the base URI, later stating that a retrieval action must not take place
if the new URI differs from the base URI only by its fragment.
|
report:
Rob Cameron,
05 May 2003,
URI-WG mailing list:
In my implementation, I've assumed the following change in
the pseudocode for the algorithm in 5.2
if (R.path == "") then
if defined(R.query) then
T.path = Base.path;
T.query = R.query;
else
-- An empty reference refers to the current document
return (current-document, fragment);
endif;
becomes
if (R.path == "") then
T.path = Base.path;
if defined(R.query) then
T.query = R.query;
else
T.query = Base.query;
endif;
This seems consistent with the requests of the RDF group and
gives a clean, well-behaved algorithm.
|
action:
Roy T. Fielding,
23 May 2003,
draft 02:
Removed the special-case treatment of same-document references in favor of a
section that explains that a new retrieval action should not be made if the
target URI and base URI, excluding fragments, match.
|
019-URI-URL-URN | URI/URL/URN contemporary view |
fixed 03 | terminology |
report:
Michael Mealling,
01 May 2002,
URI-WG mailing list:
I think the consensus built in the IG and reported in
draft-mealling-uri-ig-02.txt is a good place to start.
Especially the recommendation:
1. The W3C and IETF should jointly develop and endorse a model for
URIs, URLs and URNs consistent with the '"Contemporary View"
described in section 1, and which considers the additional URI
issues listed or alluded to in section 3.
Just so you won't have to go dig the draft up, this is the "Contemporary
View":
Over time, the importance of this additional level of hierarchy
seemed to lessen; the view became that an individual scheme does not
need to be cast into one of a discrete set of URI types such as
"URL", "URN", "URC", etc. Web-identifer schemes are in general URI
schemes; a given URI scheme may define subspaces. Thus "http:" is a
URI scheme. "urn:" is also a URI scheme; it defines subspaces,
called "namespaces". For example, the set of URNs of the form
"urn:isbn:n-nn-nnnnnn-n" is a URN namespace. ("isbn" is an URN
namespace identifier. It is not a "URN scheme" nor a "URI scheme").
Further according to the contemporary view, the term "URL" does not
refer to a formal partition of URI space; rather, URL is a useful but
informal concept: a URL is a type of URI that identifies a resource
via a representation of its primary access mechanism (e.g., its
network "location"), rather than by some other attributes it may
have. Thus as we noted, "http:" is a URI scheme. An http URI is a
URL. The phrase "URL scheme" is now used infrequently, usually to
refer to some subclass of URI schemes which exclude URNs.
|
action:
Roy T. Fielding,
27 Oct 2002,
draft 00:
Fixed by rewriting the section on URI, URL, and URN, and changing
all use of the term URL in the specification to URI.
|
report:
Tim Bray,
21 Feb 2003,
URI-WG mailing list:
Sec 1.2 - the spec says it deprecates the terms URL and URN and
I'm not sure it really does. What it's really deprecating is the notion
of a clean useful separation between locators and names. I've never seen
"URN" used in this sense anyhow, in fact I've never seen it used aside
from a reference to what the URN RFC defines, which is hard to argue
against. If you want to deprecate the term URL that's at least
consistent, although once again I have some nervousness about trying,
in the Academie Francaise style, to stop people from using words they
want to use. Potential reword of the paragraph:
'An individual scheme does not need to classified as being just one of
"name" and "locator". Instances of URIs from any given scheme may have
the characteristics of names or locators or both, often depending on the
persistence and care in the assignment of of identifiers by the naming
authority, rather than any quality of the scheme. For this reason,
this specification deprecates the use of the term URN for anything
but URIs in the "urn" scheme as described in RFC 2141.
This specification also deprecates the term "URL".'
Sec 1.2, fourth para; the phrase "just like any identifier" is superfluous.
|
action:
Roy T. Fielding,
02 May 2002,
draft 02:
Done.
|
report:
Tony Hammond,
27 May 2003,
URI-WG mailing list:
Just a brief comment on the revised draft. This passage from end 2nd para,
section 1.1.3, strikes me as very peculiar:
'This specification deprecates use of the term "URN" for anything
but URIs in the "urn" scheme [RFC2141]. This specification also deprecates
the term "URL".'
Given that a URI scheme may be classified as a 'locator', a 'name' or both,
how can the term 'URL' be deprecated while maintaining the term 'URN'? This
seems to introduce an imbalance into the glossary of terms. Surely in the
contemporary view the only term of any significance is 'URI'. IMO the term
'URN' should be deprecated wholesale along with the term 'URL'. The 'urn'
scheme just marks out a certain class of URIs which have a particular
semantics - i.e. 'persistence'.
|
action:
Roy T. Fielding,
06 Jun 2003,
draft 03:
Removed the two sentences on deprecation -- they are not worth the
effort that has alredy been spent on it.
|
024-identity | Resource should not be defined as anything that has identity |
fixed 06 | terminology |
report:
Miles Sabin,
09 Sep 2002,
URI-WG mailing list:
http://lists.w3.org/Archives/Public/uri/2002Sep/0016.html
At issue is the first sentence of the informal definition of resource in
RFC 2396 1.1,
A resource can be anything that has identity.
"that has identity" is redundant because *everything* has identity in
the only reasonably straightforward understanding of identity, ie. the
logical truth in all but the most obscure formal systems that,
(Vx) x = x
Even though redundant, this qualifier has had the unfortunate
consequence of leaving this sentence open to wildly different
interpretations,
* It has been read as implying that the set of possible resources is a
subset of the set of things: the subset that has identity as opposed
to the subset that doesn't. Dan Brickley reports that this confusion,
and the subsequent hunt for things which *don't* have identity and
some means for identifying them, has caused trouble in RDF circles.
* It has been misread as,
A resource can be anything that has an identifier (eg. a URI).
* It has been misread as,
A resource can be anything that can be identified (via some
effective mechanism).
I don't believe that any of these were the authors intent, so to clear
up any confusion, the "that has identity" qualifier should be dropped.
That still leaves open the question of whether or not the residual,
A resource can be anything.
is either true or makes sense. This is controversial, no doubt, but it's
better not to have the controversy obscured by a distracting
qualification.
|
action:
Roy T. Fielding,
12 Sep 2002,
issues list:
The sentence says "can be", which implies exactly what I meant it to
imply: that anything with identity can be a resource but not necessarily
is a resource. I see no reason to change it. The important bit is that
sameness of identity is the important characteristic -- the defining
characteristic -- of a resource.
The goal of the sentence is to describe the essence of what it means
to be a resource. None of the other suggestions do that.
|
report:
Pat Hayes,
21 Apr 2003,
URI-WG mailing list:
1. I appeal to the WG to please explain in more detail what the word
"resource" is intended to refer to, if only in broad outline. In particular,
If there is an intent to limit the meaning of "resource" to some subset of
the universe of logically possible entities, it would be most valuable if
this could be spelled out as clearly as possible. This issue appears to be
central to many aspects of the semantic web, and probably to the web more
generally. The language of the introductory sections of RFC 2396, reproduced
in the current version of your document draft, is not sufficient to achieve
a clear communication of this intent as it stands.
As some examples, are any of the following NOT resources in the sense used
in your document?
a. A document which has not yet been written, eg a book in progress, which
has not (yet) been assigned a title or ISBN number.
b. A particular elephant, eg one in a zoo.
c. A particular elephant which is now dead, eg the original Jumbo.
d. A particular elephant which it is hoped will be the product of a future
mating between two elephants.
e. Santa Clause (in any sense, eg as a fictional character, or as a concept
in folk mythology, or whatever. Or use Sherlock Holmes or Superman or any
other fictional character, if you prefer.)
f. The planet Mars.
g. The number one thousand seven hundred and twenty-nine.
h. An abstract class or category, such as the class of all types of French
red wine.
----
2. Miles Sabin, in an archived email comment, points out that the phrase 'that
has an identity' is redundant as a qualifier, since everything necessarily has
an identity. Your response says that 'The goal of the sentence ("A resource can
be anything that has an identity.") is to describe the essence of what it means
to be a resource' and that 'sameness of identity is the ... defining
characteristic of a resource'. The only way I can interpret this is as saying
that a resource can be anything, since the defining characteristic is
apparently a tautology. Is that what you intended? If not, can you clarify your
intended meaning? In particular, how do the following sentences differ in
meaning, in your view?
A. Anything with identity can be a resource but not necessarily is a resource.
B. Anything can be a resource but not necessarily is a resource.
It might help if you could indicate what you consider the phrase 'has an
identity' to mean, particularly when used as a qualifier, perhaps by giving an
example of something that does not have an identity, in your sense.
----
3. I would like to ask for some explication of the use of the words "can be" in
the definition, to which you draw attention in your reply to Sabin. I take it
that this is intended to convey that there is a distinction between entities
which could possibly be resources, and those that actually are resources. If
this is right, can you explain the criteria for distinguishing actual from
merely possible resources? That is, suppose X is something which *could be* a
resource; what would make X *actually be* a resource?
Can something become an actual resource at a time, or cease to be a resource at
a time? Can something be intermittently an actual resource, or must each actual
resource have an uninterrupted period during which it is being the resource
that it in fact is? Questions like this will be central if we try to make
formal theories of resource-hood for use by reasoners.
----
4. RFC 2396 includes a particular note which is very hard to interpret: "The
resource is the conceptual mapping to an entity or set of entities, not
necessarily the entity which corresponds to that mapping at any particular
instance in time. Thus, a resource can remain constant even when its
content---the entities to which it currently corresponds---changes over time,
provided that the conceptual mapping is not changed in the process."
There are several problems with this.
First, it does not specify what it means by "conceptual mapping", nor how such
a mapping can remain constant while its range changes.
Second, it does not say what is meant by the phrase "entity which corresponds
to [a] mapping at [an] instant of time". What does it mean for something to
'correspond to' a mapping?
Third, the use of the word "content" seems to suggest that resources are
something like representations or descriptions, rather than the entities which
are represented or described; but this seems to be at odds with what the
document says in the immediately preceding paragraphs. For example, we are told
explicitly that a person or a book can be a resource, but neither people nor
books are the kinds of entity which would normally be described as having
"content".
Fourth, the reference to time and change seems to imply that resources are
inherently temporal or dynamic in their nature; but this does not seem to be
reflected in any other part of the document, or in URI syntax, or in the
examples given explicitly in the immediately previous paragraphs. For example,
what kind of mappings can have different things 'corresponding' to them at
different times?
Fifth, is this paragraph supposed to apply to all resources, or only to
indicate that some resources may be dynamic in the way indicated?
(My purpose, let me emphasize, is not to urge that any particular
interpretation be put on these words, only that their intended meaning be
spelled out more clearly. )
----
5. The RFC 2396 text explicitly asserts that "not all resources are network
"retrievable" ", but almost immediately then says: "having identified a
resource, a system may perform a variety of operations on the resource, as
might be characterized by such words as 'access', 'update', 'replace' or
'find attributes' "
These assertions seem to be at odds with each other, and to reflect different
notions of 'resource', since the second sentence seems to refer only to
entities which are "network-retrievable". Clearly, a resource which is not
retrievable is not available to have operations performed on it, even if it is
in some sense identified. As an example, the SS number of a dead US citizen
is sufficient to 'identify' that person in a sense, but does not provide any
way to perform operations on the deceased.
Again, it would be helpful if the apparent contradiction could be explained.
|
action:
Roy T. Fielding,
22 Apr 2003,
issues list:
I explained rfc2396's usage of "identity" in
http://lists.w3.org/Archives/Public/www-tag/2002Jul/0128.html
|
report:
Tim Bray,
22 Apr 2003,
URI-WG mailing list:
I have a suggested wording change, because while I have been largely
unimpressed by the philosophical jargon being thrown around here recently,
I do agree that the current definition "A resource can be anything that has
identity" offers significant room for improvement; among other things it
deserves to be called out and not sequestered in a <dd>.
Here you go:
Resources and URIs
Many different abstract, informational, and physical things may be resources.
URIs exist to identify resources, but this "identity" relationship has both
social and technical dimensions.
For example, it is incontrovertible that the URI http://www.tbray.org/A0.png
identifies a resource which is a particular bitmapped graphic (I assert this, I
control tbray.org, and the assertion is verifiable via technical means) and
that the URI http://www.w3.org/1999/xhtml identifies a resource which is a
well-known markup vocabulary (established by social convention). It is
possible for ambiguity to enter this relationship; for example, does
http://www.w3.org/Consortium identify an organization or a particular HTML page
on its website?
A few principles apply:
- While the definitions of URI and Resource are somewhat circular, the
existence of a URI does not imply the existence of a resource. For example,
the URI http://example.com/386751531 identifies no resource.
- Formally, resources could exist without URIs - for example, there is a
picture of my cat somewhere on http://www.tbray.org but I'm not publishing a
URI. However, such resources have no practical import or utility.
- URI schemes may impose constraints on the types of resource they identify;
for example, ftp: URIs identify files and directories accessible using the
FTP protocol.
- Ambiguity in the characterization of what resource a URI identifies is always
undesirable and reduces the utility of both the resource and the URI.
|
action:
Roy T. Fielding,
23 Apr 2003,
issues list:
A ridiculous amount of discussion took place on the mailing list regarding
this issue without illuminating it further, so I won't copy it here
except by reference to the main threads:
http://lists.w3.org/Archives/Public/uri/2003Apr/0028.html
http://lists.w3.org/Archives/Public/uri/2003Apr/0041.html
http://lists.w3.org/Archives/Public/uri/2003Apr/0062.html
|
report:
Joshua Allen,
23 Apr 2003,
URI-WG mailing list:
As far as I can tell, it is only the differing choice of words
that makes everyone appear to be disagreeing. As long as the words
chosen are clearly defined, I see no point in getting hung up over
*which* word is used.
We divide the world up like this:
A. There are things. Everything is a "thing". There are no exceptions.
B. There are things which *might* have a URI bound to them.
C. There are things which *do* have a URI bound to them.
Is B the same thing as A? *That* question is irrelevant and not worth
arguing about IMO. It seems like the only *legitimate* confusion is
around the names for A/B and C.
I personally have always thought that:
A="thing", B="resource", and C="resource with a URI".
You (MM) are saying that:
A="thing", B="thing", C="Resource"
I personally have no problem accepting your naming for "C", so long as
it is very clear that this is different than A or B. I would also
(personally) suggest that terminology be kept clear by using:
A="thing", B="thing which hasn't been bound to a URI", C="Resource".
|
action:
Roy T. Fielding,
27 Apr 2003,
draft 02:
I have rewritten the definitions. It would be pointless to attempt to
further define words that can be found in any dictionary. Instead, I
added more examples and chose words that are less likely to prick the
sensibilities of those who use URIs only for denotation. Additional
terminology will be addressed in issue 022-definitions.
|
action:
Roy T. Fielding,
02 Apr 2004,
draft 05:
The definition has been updated to
Resource
Anything that has been named or described can be a resource.
Familiar examples include an electronic document, an image, a
service (e.g., "today's weather report for Los Angeles"), and a
collection of other resources. A resource is not necessarily
accessible via the Internet; e.g., human beings, corporations, and
bound books in a library can also be resources. Likewise, abstract
concepts can be resources, such as the operators and operands of a
mathematical equation, the types of a relationship (e.g., "parent"
or "employee"), or numeric values (e.g., zero, one, and infinity).
These things are called resources because they each can be
considered a source of supply or support, or an available means,
for some system, where such systems may be as diverse as the World
Wide Web, a filesystem, an ontological graph, a theorem prover, or
some other form of system for the direct or indirect observation
and/or manipulation of resources. Note that "supply" is not
necessary for a thing to be considered a resource: the ability to
simply refer to that thing is often sufficient to support the
operation of a given system.
|
report:
Dan Connolly,
24 May 2004,
URI-WG mailing list:
Dan requests that this issue be reopened due to the change from "can be"
to "has been" in draft 05. The thread can be viewed at
http://lists.w3.org/Archives/Public/uri/2004May/0026.html
Suggested change:
A resource can be anything; familiar examples include an
electronic document, an image, a service (e.g., "today's weather
report for Los Angeles"), and a collection of other resources,
but there is no constraint on what is a resource.
|
action:
Roy T. Fielding,
15 Jul 2004,
draft 06:
The definition has been updated to
Resource
This specification does not limit the scope of what might be a
resource; rather, the term "resource" is used in a general sense
for whatever might be identified by a URI. Familiar examples
include an electronic document, an image, a source of information
with consistent purpose (e.g., "today's weather report for
Los Angeles"), a service (e.g., an HTTP to SMS gateway), a
collection of other resources, and so on. A resource is not
necessarily accessible via the Internet; e.g., human beings,
corporations, and bound books in a library can also be resources.
Likewise, abstract concepts can be resources, such as the
operators and operands of a mathematical equation, the types
of a relationship (e.g., "parent" or "employee"), or numeric
values (e.g., zero, one, and infinity).
Identifier
An identifier embodies the information required to distinguish
what is being identified from all other things within its scope
of identification. Our use of the terms "identify" and
"identifying" refer to this purpose of distinguishing one resource
from all other resources, regardless of how that purpose is
accomplished (e.g., by name, address, context, etc.). These terms
should not be mistaken as an assumption that an identifier
defines or embodies the identity of what is referenced, though that
may be the case for some identifiers. Nor should it be assumed that
a system using URIs will access the resource identified: in many cases,
URIs are used to denote resources without any intention that they
be accessed. Likewise, the "one" resource identified might not be
singular in nature (e.g., a resource might be a named set or a
mapping that varies over time).
|
031-query-def | query definition |
fixed 02 | query |
report:
Hrvoje Simic,
13 Nov 2002,
URI-WG mailing list:
In section 3.4. RFC 2396 says: "The query component is a string of
information to be interpreted by the resource." If the resource is
identified before the query component is interpreted, why is the query a
part of the identifier? [1] I believe the RFC 2396 revision should
redefine the query component of the URI.
I found that Jim Whitehead had the same complaint on the definition four
years ago:
[[ This implies to me that if it is to be interpreted by the resource,
it cannot also be identifying that resource. My rationale is the
resource needs to be identified first, before the query component can be
passed to it for interpretation, hence the query component cannot be
part of the resource identifier. ]] [2]
Larry Masinter replied:
[[ I can see now how you'd come to that conclusion; it does sound that
way. But I'll claim that we didn't MEAN IT. ]] [3]
More recent posts by Mark Nottingham:
[[ mailto allows you to specify a subject, body, etc. in the query
component, which is defined by 2396 as: "...a string of information to
be interpreted by the resource." Considering other uses of queries, this
seems to fit in nicely. ]] [4]
[[ This touches on something that's been on my mind for a while. If a
query is "a string of information to be interpreted by the resource,"
isn't it the case that a URI with a query refers to a resource, rather
than just identifies one? E.g., <http://www.example.com/foo?bar=baz> is
a reference to the resource <http://www.example.com/fooglt;. I.e.,
shouldn't the definition of URI-Reference (rather than URI) include not
only fragments, but also queries? ]] [5]
Reply by Martin Duerst:
[[ Definitions are often chosen on their practical value, rather than on
philosophical considerations. In this case, the URI is what you (e.g.)
send to the server, the URI Reference is what you (e.g.) put into an
attribute. ]] [6]
My ideas on redefinition: query should be "identifying the resource
within the scope of that scheme and authority" just as the path is. The
difference between the components may be in ordering: while the path
segments must be in strict order (defining the path through a
hierarchy), query segments may be in arbitrary order, like "parameters"
or "switches". Information in query segments may also be optional and
generally more detailed than the path segments [1].
As for the troubling "mailto query", no such thing exists. The "mailto"
scheme doesn't comply with the "generic URI" syntax from the section 3
of the RFC 2396. The defining document, RFC 2368, in section 2 defines
"headers" with similar syntax but unrelated to RFC 2396 "query".
Hrvoje Simic
FER, University of Zagreb, Croatia
mailto:hrvoje.simic@fer.hr
mailto:hrvoje.simic@zg.hinet.hr
[1] http://www.tel.fer.hr/users/hsimic/cuc2002
[2] http://lists.w3.org/Archives/Public/w3c-dist-auth/1998OctDec/0180.html
[3] http://lists.w3.org/Archives/Public/w3c-dist-auth/1998OctDec/0201.html
[4] http://lists.w3.org/Archives/Public/uri/2002Apr/0010.html
[5] http://lists.w3.org/Archives/Public/uri/2002Apr/0011.html
[6] http://lists.w3.org/Archives/Public/uri/2002Apr/0014.html
|
report:
Mark Nottingham,
13 Nov 2002,
URI-WG mailing list:
Those feel like guidelines more than hard semantics; IIRC, the main
distinction between URI path segments and URI parameters is that
parameters aren't ordered, so that aspect doesn't distinguish queries.
Perhaps what does distinguish queries is that while they are used in
identifying the resource, they aren't used directly in
locating/dereferencing it; just as fragment identifier semantics are
interpreted on the client side in the scope of the resource's
representation, so queries are interpreted on the server side in the scope
of the located resource (which may be a new concept).
|
report:
Graham Klyne,
13 Nov 2002,
URI-WG mailing list:
How they are interpreted is entirely up to the software that provides
access to resources for the indicated authority.
|
report:
Hrvoje Simic,
14 Nov 2002,
URI-WG mailing list:
1) Should the query component be redefined, and how?
Yes, but it's hard to think up a good definition. In the "classic" Web,
it was the parameters you passed to the program found in a file on a
computer using a protocol. Now these concepts of protocol, computer,
file path and parameters are much more abstract. Should it be
"http://about.example.org" or "http://example.org/about"?
"/messages/1-10" or "/messages?from=1&to=10"? Are there any "hard
semantic" reasons for preferring one solution over the other, or just
guidelines? Evolution of URI towards an abstract identifier blurred the
differences between its components. Path is effectively defined for URIs
"hierarchical in nature", which sounds like a guideline.
Query may be left opaque and abstract, something like: "URI component of
arbitrary syntax left for server-specific purposes". Or we may crack it
open and come to the next issue:
2) Should the definition include details about the query structure (like
it did for the path)?
I see that almost every message in this thread mentions query structure.
But RFC 2396 and RFC 2616 (defining http-URI) don't include such
details. My name for the parts of the query (separated with ampersands
or semicolons) is "query segments" - just to make query sound more like
the path.
I agree that the query should preserve the order of its segments. The
order may matter to the specific server. Anyway, the segments must be
listed in _some_ order, and I see no advantage in allowing the network
to shuffle them. What I really meant was: path segments must be parsed
in the fixed order, from left to right. If you have "a/b/c" you parse
"a" to identify the branch in the next level of hierarchy and you hand
over "b/c" to it. But if you have "?a;b;c" you can look for a "b" and
then continue to parse the "?a;c". This allows clients to communicate
information about resource's identity that isn't naturally placed in the
hierarchy, i.e. that doesn't fit nicely in a sequence of steps through
the hierarchy.
[1] http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2
[2] http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4
|
report:
Mark Nottingham,
14 Nov 2002,
URI-WG mailing list:
'Semantics' isn't the correct term to use; Graham pointed out that this
implies too much. His suggestion was 'processing model', and that seems
to capture it very well; When used as a locator, a URI has a processing
model that is used (usually to retrieve a representation of the
resource). Each URI scheme defines its own processing model that enables
location of resources of that type.
The question, then, is whether these (not resource-, site- or
non-location) processing models are exclusive to the agent that is doing
the location. If the query is considered as data to be consumed by the
resource, this means that the processing model is effectively
distributed; the resource consumes part of the URI as well (which indeed
seems to be the case today for most uses of query).
Remember that 'Resource' is an abstract concept; it does not have to
have a one to one mapping to code on the back end. Therefore, I see no
problem with saying that query is data to be consumed by the resource
during the process of location; if the resoruce happens to be spread
across several back-end facilities on the server, so be it.
To summarize, then, the idea is that:
* Query is part of the URI for the purposes of identification; every URI
with a different query string is a different identifier (just as it is
now).
* Query is data to be consumed by the resource during the process of
location (just as 2396 says).
* It is worthwhile to distinguish between URIs and URLs, not because
they identify different things, but because these terms can be used to
distinguish between different contexts of use - identification vs.
location.
E.g.,
Given:
http://www.example.com/foo?bar
When used as an identifier (URI), the resource is
http://www.example.com/foo?bar
While, in the context of a locator (URL), the resource located is
considered
http://www.example.com/foo
I realise that this is a largely theoretical problem, in that it doesn't
affect how anything actually works; however, it may affect how people
think, which is just as, if not more, important.
|
action:
Roy T. Fielding,
15 May 2003,
draft 02:
The definition of query has been rewritten for draft 02:
The query component contains non-hierarchical data that, along with
data in the path component, serves to identify a resource within
the scope of that URI's scheme and naming authority (if any).
The query component is indicated by the first question-mark ("?")
character and terminated by a crosshatch ("#") character or by the
end of the URI.
query = *( pchar / "/" / "?" )
The characters slash ("/") and question-mark ("?") are allowed to
represent data within the query component, but such use is discouraged;
incorrect implementations of relative URI resolution often fail to
distinguish them from hierarchical separators, thus resulting in
non-interoperable results while parsing relative references.
However, since query components are often used to carry identifying
information in the form of "key=value" pairs, and one frequenty used
value is a reference to another URI, it is sometimes better for
usability to include those characters unescaped.
|
033-dot-segments | relativising an absolute reference should be invertible |
fixed 03 | relative |
report:
Tim Berners-Lee,
18 Nov 2002,
TAG meeting:
RFC2396 doesn't say that xxx/./yyy is equivalnet to xxx/yyy for any
xxx and yyy. However, the only tenable situation is that they are
equivalent, because we require that any URI can be relative-ized and
absolute-ized back to its original. That is an (unspoken) axiom.
When you relative-ize things and re-absolutize then,
you cannot distinguih between the two, and so they HAVE to be
equivalent. The URI spec should say that.
We need to write down the axioms: if you take a URI, make it
relative w.r.t. a base URI, then make it absolute w.r.t. the same base
URI, you get the same starting URI...
http://www.w3.org/2002/11/18-tag-summary
|
action:
Roy T. Fielding,
17 Apr 2003,
issues list:
I would rephrase that as:
When you relativize an absolute URI (A) using base (B) producing
the relative reference (R), and then re-absolutize R using the same
base B to produce an absolute reference A', then A' must equal A.
|
report:
Tim Berners-Lee,
23 Jan 2003,
URI-WG mailing list:
The spec would do well to define the function from base and reference to
URI and back again
rel(u, base) and abs(u, base)
and to point out that you can use abs(rel(u, base), base) for u in all
circumstances.
|
report:
Tim Bray,
25 Feb 2003,
URI-WG mailing list:
If I am given the URI http://example.com/a/./b/../c I will always,
100% of the time, regard that as http://example.com/a/c. I have just
verified that the first two randomly-picked web browsers I picked
in fact do this. So the assertion that this only applies to the
relative form is, I assert, simply wrong and should be removed.
I think you need to look more closely at what the browsers are doing.
They send the /../ and /./ stuff to the server, whereupon an httpd
will respond with a redirect to the correct URI.
Nope. Peering deep into my high-powered research lab... I created a
test file as follows:
foo <a href="http://example.com/a/./b/../c">foo</a> bar
I open it, put my mouse over the blue underlined "foo" and observe
what appears in the status-bar of the browser. Under OS X, in each of
IE, Mozilla, and Safari, the status bar shows http://example.com/a/c
and I'm pretty sure it doesn't call out to the server to check.
So I stand by my claim that deployed software normalizes /./ and /../
regardless of whether it's relative or absolute.
|
action:
Larry Masinter,
25 Feb 2003,
URI-WG mailing list:
Whether "a/./b/../c" in a path component is equivalent to
"a/c" is entirely dependent on the definition of
the URI scheme. Some schemes may define the two as
equivalent, others may not.
The current definition of the 'http' URI scheme
(in RFC 2616) does not specify this equivalence,
although apparently popular browsers will turn
http://example.dom/a/./b/../c into
http://example.dom/a/c before sending.
Do you think it should apply to all URI schemes
that use the "generic syntax"? "rtsp:"? "ldap:"?
What about schemes that use something like
the "generic syntax" but make modifications?
Note that mailto:a/./b/../@test.com sends a message
to a/./b/../@test.com, i.e., it doesn't process
them.
I'm having trouble telling what happens without
a protocol trace with
ftp://ftp.ietf.org/ietf/../ietf/00dec/, or
with ldap:.
But I think it is a good idea to resist the
tendency to jump from examination of the
behavior of http URIs to assert properties
of all URIs.
|
action:
Roy T. Fielding,
25 Feb 2003,
issues list:
I still get those segments in httpd access log files, but all we need
are two independent implementations to justify a change. I think it
is safe to remove them based on the theory that "/" is reserved for
the hierarchical syntax. I can't think of a real mailto example that
would break, since even distinguished-name-based addresses are not
going to have ".." or "." as a DN.
|
action:
Roy T. Fielding,
23 May 2003,
draft 02:
Defined "." and ".." path segments as being applicable to all URI and
should be removed by resolvers and normalizers. Clearly defined that
a path segment including a colon cannot be used as the first segment
in a relative-path reference. The relative resolution process is
invertible, though I have not included a single process for doing so
because there is no agreed upon standard for converting absolute
references to a relative form.
|
report:
Rob Cameron,
25 May 2003,
URI-WG mailing list:
Following up on issue 033, the following equation is an
important test of consistent implementations:
resolve_relative_URI(compute_relative_URI(u, base), base) = u
To satisfy this for all examples in section 5.4 of rfc2396bis-02,
a change in the algorithm of section 5.2 is required.
The change is needed to ensure that the ./ and ../ processing
is applied even when a relative URI is a path starting with "/".
In my implementation:
if R_path[0] == "/": T_path = merge('', R_path[1:])
A parser and URI processing algorithms that pass all the tests
are available for examination and comment at the following URL.
http://www.cs.sfu.ca/~cameron/uri/URIbis2.py
The parser is based on abnf2re and the algorithms include
compute_relative_URI as described yesterday, with one bug fix
to handle the authorityless relative URI.
The test program is available as well.
http://www.cs.sfu.ca/~cameron/uri/bis2tests.py
|
action:
Roy T. Fielding,
03 Jun 2003,
draft 03:
Separated the path merge routine into two routines: merge, for
describing combination of the base URI path with a relative-path
reference, and remove_dot_segments, for describing how to remove
the special "." and ".." segments from a composed path. The
remove_dot_segments algorithm is now applied to all URI reference
paths in order to match common implementations and improve the
normalization of URIs in practice. This change only impacts the
parsing of abnormal references and same-scheme references wherein
the base URI has a non-hierarchical path.
|
038-qualified | qualified production in hostname is ambiguous |
fixed 02 | hostname |
report:
Graham Klyne,
02 Feb 2003,
URI-WG mailing list:
Ref:
[[
hostname = domainlabel [ qualified ]
qualified = *( "." domainlabel ) [ "." toplabel [ "." ] ]
domainlabel = alphanum [ 0*61( alphanum / "-" ) alphanum ]
toplabel = alpha [ 0*61( alphanum / "-" ) alphanum ]
alphanum = ALPHA / DIGIT
]]
I think the syntax production 'qualified' is ambiguous
(i.e. permits more than one parse tree for some valid values).
consider:
.abc.def
is this
"." <domainlabel> "." <toplabel>
or
"." <domainlabel> "." <domainlabel>
?
I think the production could be written thus:
qualified = *( "." domainlabel ) [ "." toplabel "." ]
|
report:
Clive D.W. Feather,
02 Mar 2002,
URI-WG mailing list:
Is this the only place "qualified" is used? If so, then there's a further
ambiguity - if a hostname consists only of a single domainlabel, is it
followed by a zero-length qualified or not. I would suggest that the
correct resolution is either:
hostname = domainlabel [ qualified ]
qualified = *( "." domainlabel ) "." toplabel [ "." ]
if you want to forbid hostnames like "abc.123", or:
hostname = domainlabel [ qualified ]
qualified = 1*( "." domainlabel ) [ "." ]
or
hostname = domainlabel [ qualified ] [ "." ]
qualified = 1*( "." domainlabel )
(these are not equivalent) if you want to allow them.
|
action:
Roy T. Fielding,
03 Mar 2003,
issues list:
Fixed in draft 01.
|
report:
Graham Klyne,
05 Mar 2002,
URI-WG mailing list:
I have to say that the 'hostname' syntax as specified an RFC2396bis is a pain
to parse accurately. I think it's sufficiently difficult to get exactly
right that it won't be correctly implemented as specified in many
applications, which leaves me wondering if it really should be so
fussily correct with respect to domain name usage.
(The reason I'm noticing this is that I've been using the URI parsing task
to experiment with some programming tools and techniques that offer a more
direct correspondence between specification and the source code. If I were
doing this as part of a real application, I would long ago have ignored the
detailed syntax and done something very similar but much easier to implement.)
The problem is in the production for 'qualified'. To determine whether an
incoming ".abc" is a 'domainlabel' or a 'toplabel' requires a significant
lookahead, to the following '.' (if present) and the character following
that. To determine if an incoming ".123" is valid can require an
arbitrarily long lookahead (e.g. http://0.123.4.5.6.7.8.9.10.11.12.13.x/).
I think parsing precisely according to the syntax would be greatly
simplified if the syntax were relaxed so that:
qualified = *( "." domainlabel ) [ "." ]
i.e. drop the syntactic prohibition of URIs like this:
http://www.example.123./foo
I appreciate this is not strictly correct, but I see no practical harm
from defining the syntax in this way and asserting the form of the final
domain label as an extra-syntactic constraint. A (limited) few tests
with my browser suggest that it does not syntactically prohibit numeric
top-level domain labels, but simply reports that the domain cannot be found.
...
If you really want to keep the syntactic constraint in place, I suggest
an alternative approach:
hostname = qualified
qualified = numericlabel "." qualified /
toplabel [ "." [qualified] ]
numericlabel = DIGIT [ 0*61( alphanum / "-" ) alphanum
...
I think there's a typo in the syntax production for 'toplabel':
s/alpha/ALPHA/ ?
|
action:
Roy T. Fielding,
18 Mar 2003,
issues list:
Reopened. It would be best to have a syntax that was both unambiguous
and easy for LALR parsers to process, but that may require too many changes.
|
action:
Roy T. Fielding,
14 May 2003,
draft 02:
Fixed as suggested:
qualified = *( "." domainlabel ) [ "." ]
with additional text added for disambiguation of host.
|
action:
Roy T. Fielding,
10 Feb 2004,
draft 04:
All of these productions have been removed from draft 04 in favor
of the reg-name production.
|
041-encoding | Section 2 on encoding causes too much confusion |
fixed 04 | characters |
report:
Stefan Eissing,
31 Jan 2003,
URI-WG mailing list:
It is context dependant if '%61' can be considered equivalent
to the charcter 'a' or not. The argument basically is that RFC 2396 allows
other character encodings than US-ASCII and that '%61' could denote
basically any character unless the character encoding becomes known.
I argue that any 7 bit octet, escape-encoded in an URI, it MUST
be equivalent (apart from reserved characters like %2f) to its
US-ASCII character. In my opinion, RFC 2396 already defines this:
In RFC 2396, Ch. 2.1
"In the simplest case, the original character sequence contains only
characters that are defined in US-ASCII, and the two levels of
mapping are simple and easily invertible: each 'original character'
is represented as the octet for the US-ASCII code for it, which is,
in turn, represented as either the US-ASCII character, or else the
"%" escape sequence for that octet."
In RFC 2396, Ch. 2.4.2:
"For example, "%7e" is sometimes used instead of "~" in an http URL
path, but the two are equivalent for an http URL."
Accordings to this, my argument should be valid at least for HTTP URIs.
|
report:
Martin Duerst,
22 Feb 2003,
URI-WG mailing list:
The characters in a URI (the ones that are compared character-by-
character in namespaces) are just that, characters. URIs are
defined independent of any particular representation. The URI
spec says that /dir/a and /dir/%61 are equivalent, independent
of the representation. They are equivalent if they appear in
ASCII. They are equivalent if they appear on paper, on the
side of a bus, and so on. They are equivalent when spoken
over the radio. And they are equivalent when encoded as UTF-16
(as your Java example shows) or in EBCDIC.
RFC 2396 gives three levels, condensed in the following line:
URI character sequence->octet sequence->original character sequence
In practice, there are two more layers, one on each side.
We then get:
a) substrate: paper, metal, audio waves, ascii, UTF-16, EBCDIC,...
We don't want to limit that to a particular encoding.
^
| conversion depending on substrate representation
V
b) URI character sequence (just characters)
^
| conversion defined by RFC 2396 (always US-ASCII!)
V
c) octet sequence (just octets)
^
| conversion currently scheme/server dependent, moving towards UTF-8
V
d) original character sequence (file names on server, query strings,...)
^
| conversion server-dependent
V
e) original octet sequence (e.g. UTF-16 for a filename on WinNT, EBCDIC
on an EBCDIC server, and so on)
Maybe this diagram should go into the new version of RFC 2396.
|
report:
Misha Wolf,
22 Feb 2003,
URI-WG mailing list:
The one piece of terminology I have some trouble with, and which
is already in RFC 2396, is the phrase "original character sequence".
Presumably, the sequence is "original" in the sense that the entity
managing the resource has used this character sequence (eg a file
pathname) to identify it. If that is the case, then the problem I
have is simply due to the, possibly selfish, perception that the
characters I enter into the browser's address box are the "original"
characters and that these are transformed in various ways before
arriving at the entity managing the resource. The direction of the
arrows in the RFC 2396 diagram strengthens this way of perceiving
the flow. I wonder whether some word other than "original" would
be clearer?
|
report:
Martin Duerst,
24 Feb 2003,
URI-WG mailing list:
I repeat: if I'm on an EBCDIC computer, and the URI reads out as
/dir/a, that is *different* from /dir/%61. Yes, this is egregiously
broken and stupid, but it's within the bounds set by RFC2396.
I agree that it may not be extremely clear. But I disagree that your
interpretation is within the bounds of RFC 2396. For example, in
"2. URI Characters and Escape Sequences", we have:
Within a URI, characters are either used as delimiters, or to
represent strings of data (octets) within the delimited portions.
Octets are either represented directly by a character (using the US-
ASCII character for that octet [ASCII]) or by an escape encoding.
This representation is elaborated below.
Now let's take your example, "/dir/a". Let's assume that's a directory
name 'dir' and a file name 'a' on a computer that uses EBCDIC.
We don't have to care about the '/' here, because this is a separator
that is part of the URI syntax, independent of local usage (see e.g.
MSWin).
So now let's look at how the ebcdic server exposes 'dir' and 'a'.
It can either decide to expose them as EBCDIC (which makes server
implementation easier) or to expose them as ASCII (which makes the
URI more readable).
If the server on the EBCDIC system decides to expose as EBCDIC,
then this will give us the following octets:
/<84><89><99>/<81>
This then results in an URI of /%84%89%99/%81. There is no other
choice, as we have in "2.4.1. Escaped Encoding"
An escaped octet is encoded as a character triplet, consisting of the
percent character "%" followed by the two hexadecimal digits
representing the octet code.
(Well, you could claim that instead of %84, it may also be %48, because
the RFC doesn't say which order the digits go, but I hope you don't
want to go there.) For an example that is a bit different, let's
say '/d+r/a', we would get /<84><78><99>/<81> in terms of octets,
and then /%84N%99/%81 in the actual URI (because the RFC clearly
says that the octet <78> is encoded with US-ASCII, which results in
an 'N'. We could also use /%84%78%99/%81.
The other alternative is to expose the resource as US-ASCII,
i.e. have the conversion work being done on the server. In that
case, we have /<64><69><72>/<61>, which trivially results in
/dir/a. It could of course also result in /dir/%61, because
%61 is the escape for octet <61>. Please remember that it says:
Octets are either represented directly by a character (using the US-
ASCII character for that octet [ASCII]) or by an escape encoding.
This representation is elaborated below.
So overall, the server can make the choice of how to expose a
resource name as a series of octets. But it doesn't have a choice
to expose the resource name as one octet if the octet is escaped,
an as another octet if the octet is not escaped.
|
report:
Tim Bray,
07 Mar 2003,
URI-WG mailing list:
In explaining matters of character encoding, section 2.1 envisions
something sort of standing behind the URI, the phrase original character
is used (occasionally in quotes), as well as "original character sequences"
(not in quotes). So maybe there's a notion of an "original URI" hiding
behind the URI?
This is confusing because the "original URI" might differ from the actual
URI because
(a) it contains ASCII characters which are reserved, e.g. '/' or '%'
(b) it contains non-ASCII characters
(c) it contains non-character octets
A question: what gets painted on the side of a bus? The URI or the
"original" behind it? The answer is probably "The URI", except for
case (b), when it might become an IRI with the native non-ASCII characters
appearing on the side of the bus.
(c) is kind of confusing and counter-intuitive, but is the only way
I can explain the baffling language about mapping from characters to
octets, and the phrase in 2.2 "The data for a URI component".
If section 2 were redrafted as follows, all the ambiguity and hand-waving
would be squashed like a bug.
===============================================
2. Characters and URIs [New title]
A URI consists of a restricted set of characters, primarily chosen to
aid transcribability and usability both in computer systems and in
non-computer communications. Characters used conventionally as delimiters
around a URI are excluded. The restricted set of characters consists
of digits, letters, and a few graphic symbols chosen from those common
to most of the character encodings and input facilities available to
Internet users.
uric = reserved / unreserved / escaped
Within a URI, characters are either used as delimiters or to represent
strings of data (octets) within the delimited portions.
[Same as now except lose last 2 sentences.]
2.1 Encoding of Characters
In the general case, there are many mappings between characters as
abstractions comprising the smallest atomic units of text and the octets
used to store them in a computer. The US-ASCII standard specifies not
only a set of characters but a particular mode of storage where each
character's numeric value (in the range 0-127) is stored directly in a
single octet. Note that many widely deployed systems for storing
characters which include non-ASCII characters nonetheless store ASCII
characters as specified by US-ASCII directly one per octet. This
includes Shift-JIS, EUC, UTF-8, and ISO-8859 (all parts).
This RFC does not mandate the use of any particular mapping between
its character set and octets of computer storage.
2.2 The Characters in the URI Scheme
The "scheme" part of a URI consists of a sequence of ASCII characters
which represent nothing except themselves.
2.3 The Characters in Non-Scheme Parts of the URI
The ASCII characters making up a component of a URI other than the scheme
may represent an arbitrary sequence of octets. The definitions of URI
schemes MUST specify the interpretation of the characters in the
components of URIs of that scheme. There are some constraints on these
interpretations:
- The interpretation MUST conform to the productions in this RFC, i.e.
cannot rely on using a character which is forbidden to appear in
the component.
- The interpretation must be consistent: two instances of a URI
component which are equal in length and made of pairwise-identical
ASCII characters MUST represent the same octets.
- The character "%" MUST always be followed by two hexadecimal values
encoding the numeric value of a single octet. The hexadecimal
digits 'A' through 'F' are used identically to the digits 'a' through
'f', so that two URI components which differ only in the case of
hexadecimal digits used in %-encoded octets may safely be considered
identical.
2.4 Textual URIs
Many schemes may wish to constrain the components of URIs to encode
textual data, consisting only of characters from Unicode (ISO10646).
This section describes a procedure for encoding textual data for use
in URIs. Schemes which describe textual URIs SHOULD use the procedure
described in this section to generate URI components from textual data.
- ASCII characters which may legally appear in the component MUST
appear directly as themselves, i.e. 'a' may not be encoded as %61.
- ASCII characters which may not legally appear in the component MUST
be %-encoded using the numeric value specified by the US-ASCII
standard, using the upper-case hexadecimal digits 'A' - 'F'. i.e.
'<' must always be encoded as %3C.
- Non-ASCII characters MUST be converted to a sequence of octets as
specified by UTF-8, with each octet then %-encoded. I.e. Ç
(U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA) must always be encoded
as %C7%65.
===============================
I think most of the rest of section 2 is pretty well OK.
|
report:
Stefan Eissing,
10 Mar 2003,
URI-WG mailing list:
There is also chapter 1.5 (Transcribability) which uses the term URI both for
the thing on the side of a bus and a string conforming to the EBNF rules of the RFC.
Also, the last sentence of 1.5 should probably also be removed since now 6.3
recommends UTF-8 usage.
|
action:
Roy T. Fielding,
02 May 2003,
draft 02:
Section 2.1 has been rewritten. I started with the text provided by
Tim Bray, but soon found that much of it was simply repeating information
that is explained better in later sections. So, I replaced it with the
meat of the explanation (what should happen when characters get encoded),
added the simplest recommendation for UTF-8, and specified the rest by
clarifying the appropriate sections below it.
|
043-same-scheme | Should reference resolver ignore scheme if same as base URI? |
closed | relative |
report:
Rob Cameron,
30 May 2003,
URI-WG mailing list:
From [draft 02] section 4.1, I infer that the difference between a validating
and non-validating parser is that the former confirms that the syntax
of individual URI components precisely matches the specified
grammar, while the latter simply breaks the URI into its
components. This makes sense.
But in section 5.2, the algorithm for resolving a relative
reference seems to suggest that different behaviours are
required depending on whether "parse" is validating or
nonvalidating. An example at the end of 5.4.2 also refers
to this, drawing a distinction between "validating parsers" and
"backwards compatibility."
"http:g" = "http:g" ; for validating parsers
/ "http://a/b/c/g" ; for backward compatibility
Read literally, the spec could be interpreted to mean that
a new implementation of a nonvalidating parser should
actually produce "backwards-compatible" behaviour.
I'm wondering whether the intent is really to suggest
that the behaviour that produces "http://a/b/c/g" is
deprecated, but that implementors should be aware of this
behaviour in older implementations.
I was also trying to track down the genesis of this behaviour
in RFC 1630, but find only that in "the context of URI
magic://a/b/c//d/e/f" g:h expands as g:h. Is there arother
reference?
Perhaps the algorithm of section 5.2 could be simplified,
leaving the caveat on deprecated behaviour as a footnote.
|
action:
Roy T. Fielding,
03 Jun 2003,
URI-WG mailing list:
Yes, that was a poor choice of words. What I wanted to say was
that an application testing for invalid links should consider
those references to be invalid because their interpretation will
be inconsistent.
I'm wondering whether the intent is really to suggest
that the behaviour that produces "http://a/b/c/g" is
deprecated, but that implementors should be aware of this
behaviour in older implementations.
A little more than that. Some browser implementations have
insisted that this behavior is necessary for backward compatibility,
even though it has a negative impact on parsing non-hierarchical
schemes. However, I've managed to tweak the algorithm for abnormal
reference parsing to fix that, so maybe we should just restore
the deprecated behavior.
For now, I have changed it to "strict" instead of "validating".
I was also trying to track down the genesis of this behaviour
in RFC 1630, but find only that in "the context of URI
magic://a/b/c//d/e/f" g:h expands as g:h. Is there arother
reference?
Where 1630 says:
The rules for the use of a partial name relative to the URI of the
context are:
If the scheme parts are different, the whole absolute URI must
be given. Otherwise, the scheme is omitted, and:
many implementations (including CERN/W3C libwww) interpreted that
as meaning the scheme parts are ignored if they are the same.
|
action:
Roy T. Fielding,
04 Jun 2003,
issues list:
The changes already made to draft 03 for the remove_dot_segments
algorithm makes it possible to remove the exception altogether
and define the "backward compatible" result as the standard.
The reason this could not be done in the past was because a document
with an opaque base URI, such as "this:top", would have similar
references within it mangled by the parser: e.g., an absolute ref
to "this:that" would be forced into "this:/that". Since this is
no longer the case with the new algorithm, we could fix this once
and for all by specifying the loophole result as the standard.
|
action:
Roy T. Fielding,
06 Jun 2003,
issues list:
Closed due to lack of interest.
|
044-empty-path | no path is defined for scheme://ABCD?query |
fixed 05 | bnf |
report:
Mark Thomson,
08 Jun 2003,
URI-WG mailing list:
"A path is always defined for a URI, though the defined path may
be empty (zero length) or opaque (not containing any "/" delimiters)"
The production for net-path says that abs-path is optional, so for a URI like
http://ABCD?query
we have both abs-path and rel-path undefined and not empty and therefore
path would be undefined. Do we still have to assume that path is empty
even when both abs-path and rel-path are undefined ?
or is the above statement from the draft incorrect ?
|
action:
Roy T. Fielding,
08 Jun 2003,
URI-WG mailing list:
The statement is correct, but I'll need to fix the ABNF so
that it always ends up with a matching production.
|
report:
Rob Cameron,
11 Jun 2003,
URI-WG mailing list:
I've been playing with an experimental grammar modification that
addresses this problem and also addresses the following
additional wrinkle: http://ABCD+y is a legal URI according to
the ABNF (as translated to regexps by abnf2re).
parseURI('http://ABCD+y')
('http', None, '//ABCD+y', None, None)
That is, because ABCD+y is not a legal authority, the
regular expression matching rules for http://ABCD+y backtrack
to accept //ABCD+y as a path.
To address both the problem reported by Mark and the
problem above, I have found that there may be merit
to simplifying the URI production to directly reflect the
opening statement of section 3:
"The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment."
URI = scheme ":" ["//" authority] path [ "?" query ] [ "#" fragment ]
This rule reflects the five-component structure and the statement
that a path always exists, even if it is empty. It can be made
to work with either of the two following definitions of path:
path = abs-path / rel-path
path = segment *( "/" segment )
Running a parser based on either of these changes with
all the test cases listed in section 5.4 (both normal and
abnormal examples) gives precisely the same results as
with the grammar of bis-02 or bis-03. (By the way, it
might be good to have some IPv6 literals in the test
examples.)
On the problem case of http://ABCD+y, the following results.
parseURI('http://ABCD+y')
('http', 'ABCD', '+y', None, None)
Arguably, this is a better parse if http://ABCD+y is to be
accepted as a URI. It is also a better parse if http://ABCD+y
is to be ruled out by the extra-grammatical restriction: "when
an authority exists, the path must either be empty or an
abs-path." (Alternatively, "when an authority exists, the
first segment of the path must be empty.")
Overall, I think the theme of grammar simplification reflected in the
change from bis02 to bis03 is a good idea. One other
area that could use some attention is the grammar of IPv6
literals.
|
report:
Mark Thomson,
12 Jun 2003,
URI-WG mailing list:
One other simpler modification to the grammar to address the problem
of the path being undefined is:
net-path = "//" authority net-path-suffix
net-path-suffix = ["/" path-segments]
path = abs-path / rel-path / net-path-suffix
Of course, //ABCD+y in http://ABCD+y would still be parsed as a path.
|
action:
Roy T. Fielding,
09 Feb 2004,
draft 04:
I have followed Rob Cameron's suggestion for draft 04, since it greatly
simplifies the grammar (and the associated text that had to explain it).
URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]
path = segment *( "/" segment )
As a result, almost all of the intermediate rules are no longer useful
and have been removed, including hier-part, net-path, abs-path, rel-path,
and path-segments.
|
action:
Roy T. Fielding,
02 Apr 2004,
draft 05:
Upon further review, the above change had to be reverted in favor
of five separate definitions for path.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-abs
/ path-rootless
/ path-empty
relative-URI = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-abs
/ path-noscheme
/ path-empty
path = path-abempty ; begins with "/" or is empty
/ path-abs ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-abs = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nzc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nzc = 1*( unreserved / pct-encoded / sub-delims / "@" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
|
045-double-slash | inconsistent handling of .//g relative references |
closed | relative |
report:
Rob Cameron,
13 Jun 2003,
URI-WG mailing list:
It appears that there is a semantic bug in the URI spec in the
resolution of relative URIs. Consider resolution of the
relative URI ".//g" in the context of the base URI "f:/a".
Should the result be "f:/.//g" or "f://g"?
It seems to me that only "f:/.//g" makes sense; the URI "f://g"
re-interprets the path segment "g" as an authority component.
However, the spec and three reference implementations all seem
to agree that "f://g" is the answer specified.
uripath.py
join('f:/a','.//g')
'f://g'
URI.hs (Graham Klyne)
URI> absoluteUriPart "f:/a" ".//g"
"f://g"
URIbis3.py (me)
resolve_relative_URI('f:/a', './/g')
'f://g'
At least part of the solution seems to me that the
recomposition of parsed URIs (section 5.3) must be
modified to deal with the potential ambiguity. If the
authority is undefined and the path begins "//", then
"/." must be appended to the result before the path is.
|
report:
Mark Thomson,
14 Jun 2003,
URI-WG mailing list:
In addition to the semantic bug, I think the assumption that the base URI
doesn't contain dot-segments in its path should be dropped. If a URI has
an absolute path beginning with a "//" and doesn't have an authority,
then the absolute path must be written as /.//...
Another example where the target URI would be invalid is if the
relative URI is scheme:/.//ff or scheme:/..//ff and the parser is strict
or the parser is non-strict and base URI's scheme != relative URI's scheme.
One more thing. If the relative URI has a scheme then, regardless of
the base URI, the target URI will equal the relative URI for a strict
parser. Should the algorithm fail in this case if the base URI is
illegal (doesn't have a scheme) even though the target URI has nothing
to do with the base URI? (i.e., should the assumption "only the scheme
component is required to be present in the base URI" be dropped in this case?)
|
report:
Rob Cameron,
17 Jun 2003,
URI-WG mailing list:
Below I suggest modifications to section 5.3 that ensure correct URI
construction for all (scheme, authority, path, query, fragment) 5-tuples.
Motivation:
Suppose that an infostructure is to be moved from h://a/b/c/d to f:/d with all
links made relative.
It is not inconceivable that the document at h://a/b/c/d contains URIs like
h://a/b/c//e and http://a/b/c/this:that
In the first case, the following relative_URI calculation may be performed.
(URIbis3.py)
compute_relative_URI('h://a/b/c/d', 'h://a/b/c//e')
'.//e'
This relative URI is fine when resolved with respect to the original base,
applying the algorithm of 5.2.
resolve_relative_URI('h://a/b/c/d', './/e')
'h://a/b/c//e'
But when interpreted relative to f:/d, we have a problem.
resolve_relative_URI('f:/d', './/e')
'f://e'
Here e has been erroneously interpreted as an authority.
build_URI('f', None, '//e', None, None) should ensure that "/."
is prepended to path.
In the second case, the computation of a relative URI might attempt the
following construction: build_URI(None, None, 'this:that', None, None)
yielding "this:that" rather than "./this:that" as mentioned at the end of
section 4.2
For example, uripath.py exhibits this behaviour.
refTo('h://a/b/c/d', 'h://a/b/c/this:that')
'this:that'
With only slight modifications to 5.3, these ambiguities of URI
construction can be avoided.
if defined(scheme) then
append scheme to result;
append ":" to result;
endif;
if defined(authority) then
append "//" to result;
append authority to result;
endif;
if defined(path) then
if defined(authority) then
if path is neither empty nor begins with "/" then
error('an absolute or empty path is required')
endif
elsif path begins "//" then append "/." to result
elsif not defined(scheme) and
the first path segment contains ":" then
append "./" to result
endif;
append path to result
else
error('undefined path')
endif;
if defined(query) then
append "?" to result;
append query to result;
endif;
if defined(fragment) then
append "#" to result;
append fragment to result;
endif;
return result;
|
report:
Mark Thomson,
18 Jun 2003,
URI-WG mailing list:
What about URIs that have paths beginning with // but can't be interperted
as an authority (e.g., scheme://@@). Do we have to add a /. also?
(Of course, not doing so complicates both the resolution and
recomposition algorithms)
|
action:
Roy T. Fielding,
05 Feb 2004,
issues list:
I don't know if handling bad relative URI references is worth the hassle.
|