ietf
[Top] [All Lists]

RE: Last Call: draft-klensin-net-utf8 (Unicode Format for Network Interchange) to Proposed Standard

2008-01-14 06:26:45

John C Klensin wrote:

Kent,

I will try to address the comments that are essentially
editorial after the Last Call closes, but you have raised a few
points that have been discussed over and over again (not just in

I raised a number of non-editorial issues that you did not
address below...

the net-utf8 context) and that I think are worth my responding
to now (in addition to comments already made by others).  FWIW,
I'm working on several things right now that are of much higher
priority to me than net-utf8, so, especially since I think that
authors should generally be quiet during Last Call and let other
responses accumulate, are likely to cause comments from me to
come very slowly.

I think this document is at least one draft, maybe two or three
drafts, away from being of sufficient clarity and of sufficient
quality to become a standards document. In addition you state that
you don't have time right now to deal with this. I would therefore
suggest that the document be withdrawn from last call, to allow
time for clearing up the document.
 

To be "rescrictive in what one emits and permissive/liberal in
what one receives" might be applicable here.

What we have consistently found out about the robustness
principle is that it is extremely useful if used as a tool for
interpreting protocol requirements. Otherwise it is useful only
in moderation.  When we have explicit requirements that
receivers accept garbage, we have repeatedly seen the
combination of those requirements with the robustness principle
used by senders to say "we can send any sort of garbage and you
are required to clean it up".   That does not promote either
interoperability or a smoothly-running network.

The question of why net-utf8 expect receivers to clean up
normalization but does not expect them to tolerate aberrant
line-endings is a reasonable one, for which see below.  But I

I would not refer to most other line separators (that is how they
are best seen) as in any way "aberrant". Except for RS, GS, FS and
IND, they are not aberrant, and they are in no way malformed,
irregular, or any such. I agree that the situation is not ideal.
But I do think it is perfectly manageable without undue effort.

I do, however, regard using pure CR or BS to achieve accenting
or underlining to be highly aberrant, malformed, irregular and
unmanageable. (But that aberrant zoo you want to keep...)

think your invocation of the robustness principle is
inappropriate.

I'm not sure why...

Upon receipt, the following SHOULD be seen as at least line
ending (or line separating), and in some cases more than that: 

LF, CR+LF, VT, CR+VT, FF, CR+FF, CR (not followed by NUL...),
NEL, CR+NEL, LS, PS
where
LF  U+000A
VT  U+000B
FF  U+000C
CR  U+000D
NEL U+0085
LS  U+2028
PS  U+2029

even FS, GS, RS
where
FS  U+001C
GS  U+001D
RS  U+001E
should be seen as line separating (Unicode specifies these as
having bidi property B, which effectively means they are
paragraph separating).


There is also a security issue associated with this.  When there
is a single standard form, we know how to construct digital
signatures over the text.   When there are lots of things that
are to be "treated as" the same, and suggestions that various
systems along the line might appropriately convert one form into
another, methods for computing digital signatures need their own
canonicalization rules and rules about exactly when they are
applied.  That can be done, but, as I have suggested in several
other places in this note, why go out of our way to make our
lives more complex?

You already have NFC as a SHOULD not a SHALL. Which makes your
argument here entirely moot.

Apart from CR+LF, these SHOULD NOT be emitted for net-utf8,
unless that is overriden by the protocol specification (like
allowing FF, or CR+FF). When faced with any of these in input
**to be emitted as net-utf8**, each of these SHOULD be
converted to a CR+LF (unless that is overridden by the
protocol in question).

While I may not have succeeded (and, if I didn't, specific
suggestions would be welcome), the net-utf8 draft was intended
to be very specific that it didn't apply to protocols that
didn't reference it, nor was its use mandatory for new
protocols.

Agreed. But that is not related to any of my comments.

 That means that a protocol doesn't need to
"override" it; it should just not reference it.  Yes, I think it
makes a new protocol harder to write if it doesn't reference
net-utf8 than if it does.  It may also generate "do you really
need to do this" pushback against such protocols and against
protocols that try "all of this _except_" references to this
document.  That is, IMO, as it should be.  But, if other forms
can be justified for particular applications, then they should
be.   

I think it is perfectly reasonable for a protocol to define a
"profile" of Net-UTF-8; e.g. saying that "[use Net-UTF-8] except that
FF and CR+FF are allowed, and that FF is converted to CR+FF [while
normally those would have been converted to CR+LF]." Note: that
was only an example.

It just makes no sense, at least to me, to include a lot of text
in a spec whose purpose is to tell applications that don't use
the spec what to do.

Of course not. I did not say so either.

You have made an exception for FF (because they occur in
RFCs?).

We made an exception for FF --while cautioning against its use--
because it is permitted in NVT and fairly widely in text streams
and because some reasonable interpretation of its semantics are
moderately well-understood.   On the other hand, it comes with
some cautions and, if there were consensus to remove the
exception, I wouldn't personally hesitate to do that.

See my original comment, and above, for how I think this should be
resolved.

I think FF SHOULD be avoided, just like VT, NEL, and
more (see above).

I think the cautions about use of FF are just about that strong,
but it does have significant current use (albeit not an Internet
text-stream line separator).   One could put HT with it and
explain that it should be used only when its interpretation (as
a fixed number of spaces or a jump to a well-establish column)
is known, but, because those are rarely known, I/we made another
choice.

Even when it is allowed, it, and CR+FF,
should be seen as line separating.

That has never been permitted in the protocols that reference,
even implicitly, NVT.  Why make things more complicated now by
(i) introducing more flexibility for its own sake and putting
more burden on receivers and (ii) giving bodies of text that use
FF a different interpretation under this specification than they

FF is always line separating (though it has been rarely used,
fortunately): if you change page, the line preceding the change of
page is of course ended (though the paragraph need not end).
Even when interpreted as an "empty line", it is line separating
(as it should be).

have under NVT.  Deliberately introducing incompatibilities, it
seems to me, requires much more justification than  added
flexibility.

You have also (by implication) dismissed HT, U+0009. The
reason for this in unclear. Especially since HT is so common
in plain texts (often with some default tab setting). Mapping
HT to SPs is often a bad idea. I don't think a default tab
setting should be specified, but the effect of somewhat (not
wildly) different defaults for that is not much worse than
using variable width fonts.

But you have just summarized the reasons for avoiding HT.  We

No I have not. You tried to give some arguments, but I'm far from
persuaded. (Nor, it seems, is Frank Ellerman.) In practice the "problems"
are not worse that those of using variable width fonts. And that is
just about an absolute necessity when going beyond Latin/Greek/Cyrillic,
and also commonly used for Latin/Greek/Cyrillic, however detrimental
it is for "ASCII art". Unless the default tab setting is really
wacko, which it usually isn't.

don't have any standard that would give it unambiguous
semantics.  There is no way to incorporate tab column settings
(in characters or millimeters) in a text stream, so one can't

Yes there is:

0088;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION SET;;;;

**NOT** that I suggest using that! Definitely not! I'm just pointing
out that there **is** an already defined control code for setting
tab stops, that however has clear disadvantages, and is outdated.

even disambiguate with an in-band option.  That makes HT
appropriate in marked-up text (which might or might not have
better ways to specify what is wanted) or when options are being
transmitted out of band, but not in running text streams.   If
there is consensus that this needs to be addressed more
explicitly in the document, we can try to do so.

I (and apparently at least also Frank Ellerman) think that HT
should be allowed in Net-UTF-8. The default settings for tab stops
for plain text seems to work well.

You silently seems to suggest that original HT should be replaced
by one or more spaces. But how many spaces in each instance? I think
it would be better to keep the HT (which I agree is not an ideal
character, but it is very common) as is. Note that I do NOT suggest
to ever replace spaces with HT. Doing that would be a really bad
idea (but still seen sometimes, with ill effects; like for the
subject line I "got" for this message...).

B.t.w. many programs (long ago) had a bug that deleted the
last line if if was not ended with a LF.

Not that long ago.  I discovered, at a recent IETF meeting, a
printer and printer driver that would drop the entire last page
of a document, or drop the document entirely, if it didn't end
in CR LF.  I think the technical term for that is "nasty bug",
not something that requires protocol changes.

As the current Net-UTF-8 draft is written, that printer behaviour
seems entirely within what is permissible (for printing, say,
Net-UTF-8 plain text documents).

What do expect to happen if other line separation than CR+LF
is/are used? Rejection of text, error messages, ignoring/deleting
them, treating them as spaces, or what?


As an additional
comment, I think that the Net-UTF-8 document should state
that the last line need not be ended by CR+LF (or any other
line end/separator), though it should be. This is just as a
matter of normalising the line ends for Net-UTF8, not for
UTF-8 in general.

So now End of Document (however that is expressed) is also a
line-ending?  

Unless the text piece is used as a fragment (to be inserted/appended
to something else), end-of-document without explicit line-end should
end the (last) line, rather than be an error.

(And if there is a page paradigm, end-of-document (for a "complete"
document, i.e. not used as a fragment) also implies end-of-last-page,
even if there is no FF at the end.)

Unfortunately, as we have discovered many times
with email, an implied line-ending gets one into lots of trouble
about just when it is implied and, in particular, whether
digital signatures should be computed with the normal
line-ending sequence inserted as implied or over the document as
sent.   Again, these problems are much more easily dealt with by
specifying explicitly what is to be put on the wire, making the
sending system convert things to that format as needed, and
treating bugs as bugs rather than justification for making the
standard forms more complex.


As for the receiving side the same considerations as for the
(SHOULD) requirement (point numbered 4 on page 4) for NFC in
Net-UTF-8 applies. The reciever cannot be sure that NFC has
been applied. Nor can it be sure that conversion of all line
endings to CR+LF (there-by loosing information about their
differences) has been applied.

This is, at least to me, a more interesting problem.  On the one
hand, there are no constraints due to backward compatibility
with NVT.  On the other, there are at least two real constraints:

(i) There is not a single normalization form.  Four are
standardized and others, for more or less specific purposes, are
floating around (e.g., without getting tied up in terminology
niceties about what is a normalization form and what is
something else, nameprep uses, to a first order approximation,
NFKC+lowercasing).  There has never been a clear recommendation
as to which one should used globally (The Unicode Standard
discusses considerations and tradeoffs... quite appropriately,
IMO).  In order to avoid chaos, some systems and packages force
particular normalizations on whatever passes through them (by
contrast, I'm not aware of anything that goes out of its way to
convert CRLF into NEL.  From a Unicode standpoint, it would make

Conversion to EBCDIC usually converts line endings (like CRLF) to NEL.
It is not absolutely certain that NEL is converted to another
line ending/separation upon conversion to a non-EBCDIC encoding.

Unicode normalisation conversion is also, in principle and as yet,
much less likely than line ending conversion.

more sense to convert CRLF to U+2028 (which, strangely to me,
doesn't appear on your list above) but, again, AFAK, no one does
that as a matter of routine either).   The net result of this is

U+2028 does occur in my list above. IIRC it is used as "native"
line separator in at least one system (SymbianOS). Some programs
in other systems can also save files using LS as line separator.

that, if we have a string that starts out in some normalization
form (even NFC) that is then passed across the network, it may
then end up in the hands of the receiving subsystem in, e.g.,
NFD.  So it is important, pragmatically and whether we like it
or not, that the receiver check or apply normalization
regardless of what requirements me make on the sender.  The
digital signature issues are similar -- if one wants two bodies
of text to have the same signature value if they are considered
equivalent requires normalization to get even near equivalency.
Put differently, treating a body of text that is  unnormalized
on receipt as a bug to be rejected just doesn't make practical
sense, while treating text strewn with assorted characters that
might be line-ends (or, in some cases, might be something else)
doesn't.

thanks for the comments, thoughts, and careful reading.

As I mentioned, I think the document we are discussing needs a few
more drafts before it may be in good enough shape to be reissued
as "Last Call".


        /Kent Karlsson


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf

<Prev in Thread] Current Thread [Next in Thread>