Re: utf8::valid and \x14_000

Chris Hall skribis 2008-03-11 21:09 (+0000):

OK.  In the meantime IMHO chr(n) should be handling utf8 and has no 
business worrying about things which UTF-8 or UCS think aren't 
characters.


It should do Unicode, not any specific byte encoding, like UTF-?8.

Internally, a byte encoding is needed. As a programmer I don't want to
be bothered with such implementation details.

Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
(UTF-8) are happy with.  Unicode defines 0xFFFE and 0xFFFF as 
non-characters, not just 0xFFFF (which Encode::en/decode do deem 
invalid).


Personally, I think Perl should accept these characters without warning,
except the strict UTF-8 encoding is requested (which differs from the
non-strict UTF8 encoding).

In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
neither.

It's supposed to be neither on the outside. Internally, it's utf8.

One can turn off the warnings and then chr(n) will happily take any +ve 
integer and give you the equivalent character -- so the result is utf8,


The result is Unicode. The difference between Unicode and UTF8 is not
always clear, but in this case is: the character is Unicode, a single
codepoint, the internal implementation is UTF8.

Unicode: U+20AC    (one character: €)
UTF-8:   E2 82 AC  (three bytes)

I am under the impression that you know the difference and made an
honest mistake. My detailed expansion is also for lurkers and archives.

[replacement character]
So we'll have to differ on this :-)


Yes, although my opinion on this is not strong. undef or replacement
character - both are good options. One argument in favor of the
replacement character would be backwards compatibility.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####(_at_)juerd(_dot_)nl>  
<http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy 
<sales(_at_)convolution(_dot_)nl>

Re: utf8::valid and \x14_000 - \x1F_0000