On Wed, 12 Mar 2008 Juerd Waalboer wrote
Chris Hall skribis 2008-03-12 13:20 (+0000):
....
String literals are represented by UCS code points. Which
reinforces the feeling that characters in Perl are Unicode.
Yes!
OK. For the avoidance of doubt:
a. are you saying that characters in Perl are Unicode ?
b. or are you agreeing that characters in Perl take values
0..0x7FFF_FFFF (or beyond), which are generally interpreted as
UCS, where required and possible ?
If (a) then characters with ordinals beyond 0x10_FFFF should throw
warnings (at least) since they clearly are not Unicode !
....[in the context of U+D800..U+DFFF]
"Isolated surrogate code units have no interpretation on
their own."
(...)
Clearly these are illegal in UTF-8.
They have no interpretation, but this also doesn't say it's illegal.
The Unicode Standard defines the set of 'Unicode scalar values' which
consists of U+0000..U+D7FF and U+E000..U+10_FFFF. All Unicode
encodings, including UTF-8, encode only the 'Unicode scalar values'.
The code points U+D800..U+DFFF exist, but do "not contain any character
assignments". Given that no Unicode encoding exists that allows these
code points, it's unclear how one would ever end up with one of these
things on its hands !
....[in the context of U+FFFE, U+FFFF etc.]
"Applications are free to use any of these noncharacter code
points internally but should never attempt to exchange
them.
I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.
Well UTF-8 is jumping all over U+FFFF (at least). The warnings thrown
by chr() and "\x{h...h} suggest that Perl feels that exchanging these
values ain't kosher.
I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
friends) in the same way as U+FFFF (and friends).
My gut says it's out of ignorance of the "rules", and certainly not an
intentional deviation.
Well... I'm running some more tests on UTF-8 to see what it thinks is
illegal.
.....................................
>The result is Unicode.
IMHO the result of chr(n) should just be a character.
We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.
OK. This is the hair which I am splitting.
IMHO the things in strings and the things that chr() and ord() return or
process should be plain characters (ordinal U_INT) -- so that these are
general purpose. Only when it's necessary to attach meaning to the
characters in a string, should Perl treat them as Unicode code points --
I accept that this is most of the time (but not *all* the time).
FWIW I note that printf "%vX" is suggested as a means to render IPv6
addresses. This implies the use of a string containing eight characters
0..0xFFFF as the packed form of IPv6. Building one of those using
chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !
Interesting point.
What's more, the Unicode standard suggests various *internal* uses for
U+FFFE and U+FFFF (and friends), including, but not limited to,
terminators and separators. This will also generate spurious warnings
from chr() or "\x{...}" !
Chris
--
Chris Hall highwayman.com
signature.asc
Description: PGP signature