perl-unicode

Re: Warnings on illegal UTF8

1998-10-22 16:51:43
Gisle Aas writes:
Problem with this proposal is that it would make length() much slower.
Today it only looks at the first byte of each UTF-8 byte-sequence and
then skips all the 10xx xxxx bytes without looking at them.  

Which reminds me: what is the *official* stand on bitness of utf8?  Vs
the Perl one, which considers utf8 as encoding 36-bit values?

36 is pathologically low number of bits to have, say, some iso2022
escape sequences may have an infinite number of bits (compare with
ANSI color escape sequences), and one cannot encode inband data using
utf8 since it has no loopholes for extension.  (Well, no easy
loopholes, since there are multiply-encoded combinations, say, you can
encode char(32) using two bytes.)

Say, if the stand is that utf8 is in fact 31-bit, then we could usurp
utf8 leading byte which currently encodes 7-bytes long chars to mean
"infinite" number of bytes (say, the usual interpretation of the next
5 bytes gives the *length* of the following sequence of bytes - each
byte encoding 6 bits of the value, as usual).  

Since these "extended" chars are not going to come from outside world
(I assume that officially utf8 is 31-bit for the sake of discussion),
the fact that Perl treats them *internally* using a variation of the
algorithm will not create any problems.

This way we could have inband data (since one bit of these infinite
amount may denote that the char should be interpreted not as a char,
but as an address/id of some external data, say color/font for text
processing application).  When we have screen-width of chars, these
inband "pseudo-chars" may be interpreted as zero-length data, thus
treated correctly by Perl.

Ilya