perl-unicode

Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-12 09:53:45
Chris Hall skribis 2008-03-12 13:20 (+0000):
OK.  In the meantime IMHO chr(n) should be handling utf8 and has no
business worrying about things which UTF-8 or UCS think aren't
characters.
It should do Unicode, not any specific byte encoding, like UTF-?8.
IMHO chr(n) should do characters, which may be interpreted as per
Unicode, but may not.
When I said utf8 I was following the (sloppy) convention that utf8 means
how Perl handles characters in strings...

I'm working hard to break this convention. I've changed a lot of Perl
documentation, and the result was released with Perl 5.10.

If in any place in Perl's official documentation, it still reads UTF-8
or UTF8 for *characters in text strings*, it's wrong. Let me know and I
will fix it :)

  b. in a Perl string, characters are held in a UTF-8 like form.

I'd say *inside* a Perl string. This is the C implementation, but a Perl
programmer should not have to know the specific *internal* encoding of a
Perl string.

Likewise, in Perl you don't have to know whether your number is
internally encoded as a long integer or a double.

     Where UTF-8 (upper case, with hyphen) means the RFC 3629 &
     Unicode Consortium defined byte-wise encoding.

That's the theory, but it's so often not entirely following spec.

     This form is referred to as utf8 (lower case, no hyphen).

Yes, but note that encoding names in Perl are case insensitive. I tend
to call it UTF8 sometimes.

     There is really no need to discuss this, except in the context of
     messing around in guts of Perl.

Exactly.

     String literals are represented by UCS code points.  Which
     reinforces the feeling that characters in Perl are Unicode.

Yes!

     'C' uses 'wide' to refer to characters that may have values
     > 255.  IMHO it's a shame that Perl did not follow this.

It does in some places, most notably warnings about "wide characters".

  d. when exchanging character data with other systems one needs to
     deal with character set and encoding issues.

Not just other systems. All I/O is done in bytes, even with yourself,
for example if you forked.

            "Isolated surrogate code units have no interpretation on
             their own."
(...)
           Clearly these are illegal in UTF-8.

They have no interpretation, but this also doesn't say it's illegal.

Compare it with the undefined behavior of multiple ++ in a single
expression. There's no specification of what should happen, but it's not
illegal to do it.

            "Applications are free to use any of these noncharacter code
             points internally but should never attempt to exchange
             them.

I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.

I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
friends) in the same way as U+FFFF (and friends).

My gut says it's out of ignorance of the "rules", and certainly not an
intentional deviation.

The result is Unicode.
IMHO the result of chr(n) should just be a character.

We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.

OK, sure.  I was using utf8 to mean any character value you like, and
UTF-8 to imply a value which is recognised in UCS -- rather than the
encoding.

Please use utf8 only for naming the byte encoding that allows any
character value you like, not for the ordinal values themselves.

FWIW I note that printf "%vX" is suggested as a means to render IPv6
addresses.  This implies the use of a string containing eight characters
0..0xFFFF as the packed form of IPv6.  Building one of those using
chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !

Interesting point.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####(_at_)juerd(_dot_)nl>  
<http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy 
<sales(_at_)convolution(_dot_)nl>