ietf
[Top] [All Lists]

Re: Last Call: draft-klensin-unicode-escapes (ASCII Escaping ofUnicode Characters) to BCP

2007-10-21 22:22:35
I have a terminological objection to this draft, mainly in section 2. I have 
other comments regarding section 2 I'll mention.

First, terminology: the heading for section 2 has "...Table Position...", and 
the body refers to "code point position in the table". While the term "code 
table" could have been used in the Unicode Standard to refer to the encoded 
entities and their encoding, it is not.

The Unicode Standard uses these terms:

- It uses "character set" and "character repertoire" for the collection of 
elements being encoded, and "coded character set" for the set of pairs of such 
elements and their encoded representations.

- It uses "codespace" to refer to a range of numeric values used as encoded 
representations, and specifically "Unicode codespace" for the range 0 to 10FFFF 
(hex).

- It uses "code point" or "code position" (synonyms) for values in the Unicode 
codespace.

Thus, the appropriate term here is simply "code point" or "code position". 
"Table position" and "position in the table" are not appropriate since the 
Standard never uses "table" in this regard. And "code point position" is 
redundant. Perhaps the wording was attempting to differentiate between code 
points and various encoded representations of code points. But the latter are 
not code points per se, so there isn't really any ambiguity.

A possible refinement might be to use "Unicode Scalar Value": this refers to 
code points other than surrogate code points. By definition in the Standard, 
encoded characters can only be assigned to a Unicode Scalar Value. I don't see 
this as a necessary change in the draft, however.


Now for other comments on section 2.

The draft has:

  "However, when
   information about characters is to be processed by people,
   information about the Unicode code point is preferable to a further
   encoding of the encoded form of the character."

Information about the code point? (The code point of that character is numeric 
/ is an integer / is non-negative / is in the range 0 to 10FFFF / is even / is 
divisible by 17 / is the same value as the number of days the song "Hey Jude" 
was on the Top 40 list.) I think it is the code point itself that is to be 
preferred, not information about it.

Also, "a further encoding of the encoding form" isn't going to be clear to 
readers. (I'm not sure myself what these words mean themselves; I can guess at 
what the author meant, though am not positive.)

Thus, I'd change this text to:

  "However, when
   information about characters is to be processed by people,
   reference to the Unicode code point is preferable to encoded
   representations of the code point."


Now, section 2 is talking about alternate representations of an encoded 
character, but the flow is a bit mixed up, IMO. The first paragraph says that 
there are different equivalent representations but that the Unicode code point 
is preferred. Then the next paragraph revisits the same thing in more detail. 
The sentence from the first paragraph discussed above, once revised so that it 
makes a clear statement, already says what paragraph two says in greater 
detail. Whether a more succinct or more detailed statement is preferred, just 
say it once.

Of course, if the more detailed paragraph two is kept, "code point position in 
the table" should be changed to "code point".

Also from paragraph two:

   "the UTF-8
   encoding or some other short-form encoding"

The term "short-form encoding" isn't explained here and may not be understood. 
I can only guess what is meant. If the intended meaning is what I think (a 
reference to shortest-form versus non-shortest-form UTF-8), then I don't think 
it's really relevant. Either way, I'd change the wording to:

   "the UTF-8 encoding or some other encoding form"

(Encoding form is a term defined in the Unicode Standard.)

Also:

   "the other encodes the octets of"

I don't think octets are encoded; they are simply referenced using some 
notational system. Thus, change to:

   "the other uses the octets of ... in some representation."

(This gives parallel wording for the two kinds of reference.)

Finally:

   "the Unicode code point forms"

Drop "forms":

   "the Unicode code points"




Peter Constable

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf

<Prev in Thread] Current Thread [Next in Thread>