Re: utf8::valid and \x14_000

On Tue, 11 Mar 2008 you wrote

Chris Hall skribis 2008-03-11 18:48 (+0000):

I'm comfortable with the notion that perl characters are unsigned
integers that overlap UCS, and happen to be held internally as a
superset of UTF-8.
I wonder if perl is completely comfortable.

It isn't. There are some very unfortunate "features".

chr(n) throws various runtime warnings where 'n' isn't kosher UCS, and
"\x{h...h}" throws the same ones at compile time.
(...)I'm not sure I see the point of picking on a few values to warn
about.

I don't see the point, but Perl's warnings are arbitrary in several
ways. Abigail has a lightning talk about the "interpreted as function"
warning, that illustrates this.

OK. In the meantime IMHO chr(n) should be handling utf8 and has nobusiness worrying about things which UTF-8 or UCS think aren'tcharacters.


Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode

(UTF-8) are happy with. Unicode defines 0xFFFE and 0xFFFF asnon-characters, not just 0xFFFF (which Encode::en/decode do deeminvalid).

In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
neither.

It's supposed to be neither on the outside. Internally, it's utf8.

One can turn off the warnings and then chr(n) will happily take any +veinteger and give you the equivalent character -- so the result is utf8,but the warnings are some (very) small subset of checking for UTF-8 :-(


I wonder what happens for n >= 2^64.  The encoding runs out at 2^72 !

     If chr(-1) doesn't exist, then undef looks like a reasonable
     return value -- returning "\x{FFFD}" makes chr(-1)
     indistinguishable from chr(0xFFFD) -- where the first is
     nonsense and the second is entirely proper.

0xFFFD is the Unicode equivalent of undef. I think it makse sense in
this case.


Well...

Unicode says: "REPLACEMENT CHARACTER: used to represent an incomingcharacter whose value is unknown or unrepresentable in Unicode".

...so it has plenty to do without being used to represent a value whichis completely beyond the range for characters, and for which perl has aperfectly good convention already.

...besides, if I want to see if chr(n) has worked I have to check that(a) the result is not "\xFFFD" and (b) that n is not 0xFFFD.


So we'll have to differ on this :-)

Chris
--
Chris Hall               highwayman.com            +44 7970 277 383

signature.asc
Description: PGP signature

Re: utf8::valid and \x14_000 - \x1F_0000