ietf
[Top] [All Lists]

Re: Troubles with UTF-8

2006-01-03 23:00:28
On 12/23/05, Tom.Petch <sisyphus(_at_)dial(_dot_)pipex(_dot_)com> wrote:

A) Character set.  UTF-8 implicitly specifies the use of Unicode/IS10646
which
contains 97,000 - and rising - characters.  Some (proposed) standards
limit
themselves to 0000..007F, which is not at all international, others to
0000-00FF, essentially Latin-1, which suits many Western languages but is
not
truly international.  Is 97,000 really appropriate or should there be a
defined
subset?


Why should there be a subset? You really really dont want to go into a
debate of which script is more important then the other.

B) Code point. Many standards are defined in ABNF [RFC4234] which allows
code
points to be specified as, eg,  %b00010011 %d13 or %x0D none of which are
terribly Unicode-like (U+000D).  The result is standards that use one
notation
in the ABNF and a different one in the body of the document; should ABNF
allow
something closer to Unicode (as XML has done with &#000D;)?


Following RFC4234, Unicode code point U+ABCD will just be represented as
%xABCD.

I do not see the problem you mention or am I missing something?


C) Length. Text is often variable in length so the length must be
determined.
This may be implicit from the underlying protocol or explicit as in a
TLV.  The
latter is troublesome if the protocol passes through an application
gateway
which wants to normalise the encoding so as to improve security and wants
to
convert UTF to its shortest form with corresponding length changes
(Unicode
lacks a no-op, a meaningless octet, one that could be added or removed
without
causing any change to the meaning of the text).


While the simple byte counting obviously wont give you the accurate length
of the text (since one character in Unicode maybe represented by one or more
bytes), it is fairly trival to write a script to count the length of the
text accurately. Heck, perl 5.6 onwards even support Unicode natively.


Other protocols use a terminating sequence.  NUL is widely used in *ix;
some
protocols specify that NUL must terminate the text, some specify that it
must
not, one at least specifies that embedded NUL means that text after a NUL
must
not be displayed (interesting for security).  Since UTF-8 encompasses so
much,
there is no natural terminating sequence.


NUL is defined in Unicode btw but I am disgressing; You already started with
a wrong foot if you think UTF-8 as some sort of programming encoding scheme
rather then what it is; an encoding scheme for a character reportairs.


D) Transparency.  An issue linked to C), protocols may have reserved
characters,
used to parse the data, which must not then appear in text.  Some
protocols
prohibit these characters (or at least the single octet encoding of them),
others have a transfer syntax, such as base64, quoted-printable, %xx or an
escape character ( " \ %).  We could do with a standard syntax.


In those cases, Unicode U+ABCD or ANBF %xABCD do nicely. Why do we need
another one?

E) Accessibility.  The character encoding is specified in UTF-8 [RFC3629]
which
is readily accessible (of course:-) but to use it properly needs reference
to
IS10646, which is not.  I would like to check the correct name of eg
hyphen-minus (Hyphen-minus, Hyphen-Minus, ???) and in the absence of
IS10646 am
unable to do so.


In absence of a dictionary, I couldn't understand most of the words you used
in an RFC. OMG, what should I do?

http://www.unicode.org/charts/

-James Seng
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf
<Prev in Thread] Current Thread [Next in Thread>