ietf
[Top] [All Lists]

Re: Gen-ART LC Review of draft-wilde-text-fragment-06

2007-02-20 05:50:26
Hi, Martin,

Thanks for the quick response. Now I can remember what I was thinking when I wrote the review...

I deleted everything that we're already good on. Rest is inline.

Spencer

2.5.  Fragment Identifier Robustness

  Hash sums may specify the character encoding that has been used when
  creating the hash sums, and if such a specification is present,
  clients MUST check whether the character encoding specified for the
  hash sum and the character encoding of the retrieved MIME entity are
  equal, and clients MUST NOT check the hash sum if these values
  differ.  However, clients MAY choose to transcode the retrieved MIME
  entity in the case of differing character encodings, and after doing
  so, check the hash sum.  Please note that this method is inhererently
  unreliable, because certain characters or character sequences may
  have been lost or normalized due to restrictions in one of the
  character encodings used.

Spencer: I have a concern about using MAY to allow clients to check reliability in an inherently unreliable way. I would prefer at least SHOULD NOT.

I agree that at first, this looks a bit scary, and in general, is a bad
idea. But I don't think this is a big concern in this case in practice.
The failure cases of this method are highly skewed towards false negatives
(transcoding back to what the charset information in the fragment ID says
doesn't match) as opposed to false positives (a match despite the fact that
the document has actually changed). This should be obvious for MD5 hashes,
and also applies to lenght 'hashes'. In the lenght case, there is a basic
risk of false positives independent of character encoding anyway
(the document gets changed, but with the same exact resulting length).

Do you agree that this can stay as is? Or do you think some wording
change would make it easier to understand that this as such isn't a
big risk?

("This fails in theory, but usually works in practice"? Nice :-)

Thanks for the insights here. Trying to think about this specific case... The types of actions you describe in the document (cursor placement, text highlighting)

- would not be taken at all without the fragment identifer, and

- would not be taken if the client does not understand the fragment identifier, but

- would be taken if the client understands fragment identifiers and there was no "hash" present, and

- would be taken if the client understands fragment identifiers, sees a fragment identifier with a hash, and does the transcoding-plus-hash-check operation described here.

Let me suggest text, but please read it critically.

"Please note that this method is inhererently unreliable, because certain characters or character sequences may have been lost or normalized due to restrictions in one of the character encodings used. Most hash value mismatches may be "false negatives" - the hash fails because of the transcoding operation, not because of a problem with the fragment identifier."

3.  Fragment Identification Syntax

  The syntax for the fragment identifiers is straightforward.  The
  syntax defines four schemes, 'char', 'line', 'match', and hash (which
  can either be 'length' or 'md5').  The 'char' and 'line' schemes can
  be used in two different variants, either the position variant (with
  a single number), or the range variant (with two comma-separated
  numbers).  The 'match' scheme has a regular expression as its
  parameter, which must be specified as a string with escaped
  semicolons (because the semicolon is used to concatenate multiple
  fragment identification scheme parts).  The hash scheme can either
  use the 'length' or the 'md5' scheme to specify a hash value.

Spencer: The use of the word "hash" to describe the length of a resource in characters violates the Principle of Least Astonishment. Could "length" and "md5" not be grouped together, just for ease of understanding?

This is a good point. I'm a bit reluctant to make all the changes,
which would be quite extensive, but will try to do so if you insist.
An alternative is to make it much clearer in the text that talking about
length as a 'hash' may be misleading. (We really use it as a hash, but it
is only a very, very weak, but on the other side extremely cheap, hash).

Before making these changes, please see what your AD thinks! ("Last Call comments", and all that) At least adding some text that explains this would be less astonishing...

4.3.  Handling of Hash Sums

  Clients are not required to implement the handling of hash sums, so
  they MAY choose to ignore hash sum information altogether.  However,
  if they do implement hash sum handling, the following applies:

  If a fragment identifier contains a hash sum, and a client retrieves
  a MIME entity and detects that the hash sum has changed (observing
  the character encoding specification as described in Section 3.2, if
  present), then the client SHOULD NOT interpret any other text/plain

Spencer: why SHOULD NOT, and not MUST NOT?

In many cases (e.g. additions to the end of a file), the fragment id
may still be valid. In other cases (e.g. small edits shifting things
by a character or two), the user still may find the right place.
So going ahead is not always completely useless, and therefore we
wanted to give implementations some leeway to do what seems to
work best in their context (e.g. an interactive application vs.
something like an automatic extractor).

  fragment identifier scheme part.  A client MAY signal this situation
  to the user.

SHOULD NOT would be fine with me if you add a sentence or two explaining the risks for human-in-the-loop clients versus automatic extractor clients. Gen-ART reviewers usually aren't questioning SHOULD/NOTs, we're usually asking for help in understanding the tradeoffs in the document.


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf

<Prev in Thread] Current Thread [Next in Thread>