ietf
[Top] [All Lists]

RE: Last Call: draft-klensin-net-utf8 (Unicode Format for Network Interchange) to Proposed Standard

2008-01-14 06:26:45
John C Klensin wrote:

--Frank Ellermann wrote:
...
Hopefully somebody can confirm that IND is correct, or not.
For HT and FF I hope the final version will somehow express
that both are not really bad, and as far as they're bad FF is
worse than HT. 

See http://www.itscj.ipsj.or.jp/ISO-IR/077.pdf, which, somewhat
to my surprise, says that IND is an LF clone. However, IND has
long been deprecated, and never got any noticable use, and is even
REMOVED from ECMA 48. So I think it is safe to ignore IND. Indeed,
I would prefer it not be mentioned in the document we're discussing.

(I would like to say the same about NEL, but NEL is alive
and the native line separator/terminator in EBCDIC bases systems,
and may escape as NEL rather than be converted to something else.)

I'm open to consensus about changes for either HT or FF, but the
theory of "bad" that was used to construct the spec was:

(i) If a "spacing" control has the effect of setting the
position of the next character, it is "bad" unless that position
is unambiguous.   In addition, things are "bad" unless they are
necessary in running text (as distinct from faking things that
are better handled in markup, followed by either device-specific
output or standard page representations, neither of which are
normal text).

There is also another issue. If HT is converted to (presumably)
a sequence of SP, you will mess up bidi text. (See one of the
other mails I send at about the same time as this one.)

It is unambiguous for SP.  It is unambiguous for CRLF.
Independent of the "what is a line-end" problem, it is somewhat
ambiguous for CR or LF alone and for IND.  It is ambiguous for

Even though IND was, for some strange reason, defined as an LF
clone, it has long been deprecated, and AFAIK never saw any
popular use. I think it is best left forgotten and left in silence.
Note also that it is not only deprecated, but even REMOVED from
ISO/IEC 6429 (ECMA 48).

HT.  It would be ambiguous for FF except that FF is assigned
fairly clear semantics in NVT -- "FF" is not a line ending

Of course it is line ending. So is "raw" LF. That the new line
(under some circumstances) may be strangely indented is irrelevant.

(CRLF FF is needed)

That is a combination I haven't heard of before and I DON'T
think it should be regarded as one NLF. There are TWO NLFs there,
CRLF and then FF.

and as Bob Braden noted, there is a fairly clear
rule that FF is to be interpreted as "top of next page" if one

Sure. But the line before it is also ended (no matter where the
top of next page line begins).

knows what a page is and as "blank line" otherwise.  But that
rule is sufficiently often ignored to call for considerable
caution about FF, and the text now contains a cautionary note
for that reason.

I agree that there should be caution, but not in the shape and
form it has in the draft we are discussing.

There is an interesting demonstration of the law of unintended
consequences here.  If we could tell that a string was
unambiguously UTF-8 (or whatever) by looking at it, even if it
contains nothing but ASCII characters, then there would be no
reason to try to make net-utf8 a proper superset of NVT.  If we

I don't see why you really need to carry on the (unworkable in
a more general setting than ASCII, in particular it is unworkable
for the UCS) idea of using carriage return and BS for strange
overstriking. Even for ASCII, the ONLY aspect of that that worked
moderately well was using <BS, _> (or similar) for underlining.
But note also that underlining can be achieved also in the UCS
(without using kludges line <BS, _> for that) without the use
of a higher level protocol by instead using U+0332, COMBINING LOW
LINE. Though using a higher level protocol for getting underlining
is preferable (consider searching),  COMBINING LOW LINE would
still be much preferable over <BS, _> (or similar).

could do that, we could also do away with the entire "next line"
debate by prohibiting even CRLF and requiring the use of LS

LS would be a bad idea. See my other email (sent at approx. the
same time as this one). You would get (to you) unexpected effects
from bidi processing.

                /Kent Karlsson


(U+2028).  In retrospect, there might have been considerable
advantages to forcing the ASCII- UTF-8 distinction by requiring
that UTF-8 strings all start with a BOM, but it is far too late
for that (and probably not, on balance, a good idea despite its
advantages).  So I don't see how to get there from here -- we
are stuck, for historical reasons, with CRLF on the wire as what
The Unicode Standard calls NLF (incidentally, Unicode 5.0,
Section 5.8, provides significant insight into the complexity of
this problem and probably should have been referenced.  It would
be even more helpful had Table 5-2 included identifying CRLF as
a standard Internet "wire" form of NLF, not just binding that
form to Windows.
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf
<Prev in Thread] Current Thread [Next in Thread>