Chris Hall skribis 2008-03-11 21:09 (+0000):
OK. In the meantime IMHO chr(n) should be handling utf8 and has no
business worrying about things which UTF-8 or UCS think aren't
characters.
It should do Unicode, not any specific byte encoding, like UTF-?8.
Internally, a byte encoding is needed. As a programmer I don't want to
be bothered with such implementation details.
Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
(UTF-8) are happy with. Unicode defines 0xFFFE and 0xFFFF as
non-characters, not just 0xFFFF (which Encode::en/decode do deem
invalid).
Personally, I think Perl should accept these characters without warning,
except the strict UTF-8 encoding is requested (which differs from the
non-strict UTF8 encoding).
In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it's
neither.
It's supposed to be neither on the outside. Internally, it's utf8.
One can turn off the warnings and then chr(n) will happily take any +ve
integer and give you the equivalent character -- so the result is utf8,
The result is Unicode. The difference between Unicode and UTF8 is not
always clear, but in this case is: the character is Unicode, a single
codepoint, the internal implementation is UTF8.
Unicode: U+20AC (one character: €)
UTF-8: E2 82 AC (three bytes)
I am under the impression that you know the difference and made an
honest mistake. My detailed expansion is also for lurkers and archives.
[replacement character]
So we'll have to differ on this :-)
Yes, although my opinion on this is not strong. undef or replacement
character - both are good options. One argument in favor of the
replacement character would be backwards compatibility.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,
Juerd Waalboer: Perl hacker <#####(_at_)juerd(_dot_)nl>
<http://juerd.nl/sig>
Convolution: ICT solutions and consultancy
<sales(_at_)convolution(_dot_)nl>