perl-unicode

Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-12 06:22:46
On Tue, 11 Mar 2008 Juerd Waalboer wrote
Chris Hall skribis 2008-03-11 21:09 (+0000):
OK.  In the meantime IMHO chr(n) should be handling utf8 and has no
business worrying about things which UTF-8 or UCS think aren't
characters.

It should do Unicode, not any specific byte encoding, like UTF-?8.

IMHO chr(n) should do characters, which may be interpreted as per
Unicode, but may not.

When I said utf8 I was following the (sloppy) convention that utf8 means
how Perl handles characters in strings...

...the naming is a cause of confusion.  For the avoidance of doubt, this
is what I understand the position to be:

  a. characters in Perl have integer values in 0..0x7FFF_FFFF (or more).

     It appears that what is actually going on is that the limit is
     the local unsigned Perl integer.  One can debate the marginal
     utility of that vs the scope for confusion.

  b. in a Perl string, characters are held in a UTF-8 like form.

     Where UTF-8 (upper case, with hyphen) means the RFC 3629 &
     Unicode Consortium defined byte-wise encoding.

     Current UTF-8 defines encoding for values 0..0xD7FF and
     0xE000..0x10_FFFF, which is exactly the current UCS range (less
     the 'surrogates').

     Note that this limits UTF-8 to 4 byte sequences, explicitly
     excluding:

       * sequences that have shorter equivalents ('redundant')
       * 0xD800..0xDFFF -- the 'surrogates'
       * x11_0000..0x1F_FFFF -- beyond UCS range

     Older versions of the standard allowed for values 0..0x7FFF_FFFF,
     but also excluded the 'redundant' sequences and (I believe) the
     'surrogates'.

     The encoding used by Perl stretches the range to 2^72-1.  This
     is incompatible with even the older versions of UTF-8.

     This form is referred to as utf8 (lower case, no hyphen).

     There is really no need to discuss this, except in the context of
     messing around in guts of Perl.

  c. when Perl wishes to assign some meaning to a character value
     it interprets it as a Unicode Code Point, if it can.

     There are huge areas of the Unicode space that have no current
     meaning.  There are areas which may have local meaning ("Private
     Use").  In addition Perl allows character values that are beyond
     current Unicode space.

     In the abstract, characters in Perl are not Unicode (UCS).  But
     most of the time one treats them as if they were.

     String literals are represented by UCS code points.  Which
     reinforces the feeling that characters in Perl are Unicode.

     'C' uses 'wide' to refer to characters that may have values
     > 255.  IMHO it's a shame that Perl did not follow this.

  d. when exchanging character data with other systems one needs to
     deal with character set and encoding issues.

     The 'UTF-8' encoding (character set) covers the UCS character set
     (values 0..0x10_FFFF, currently) and the (current) standard UTF-8
     encoding.  'UTF-8' also worries about some 'suspect' (my term) UCS
     values, see below.

     The 'utf8' encoding (character set) is a super set of current UTF-8
     (values 0..0x7FFF_FFFF) -- corresponding to earlier UTF-8.  'utf8'
     does not concern itself about any 'suspect' UCS values.

     [Actually, that's not entirely true.  'utf8' encode happily deals
      with characters all the way up to 2^64-1 (and perhaps, beyond),
      using Perl's extended encoding.  However, 'utf8' decode treats
      anything > 0x7FFF_FFFF as invalid.]

   e. The 'suspect' UCS values.

      These are:

         * U+D800..U+DBFF and U+DC00..U+DFFF (High- and Low-surrogate,
           respectively).  Where these are used they should appear in
           pairs, High followed by Low.

           Unicode 5.0.0 says:

            "Surrogate pairs are used only in UTF-16."

            "Isolated surrogate code units have no interpretation on
             their own."

            "Surrogate code points cannot be conformantly interchanged
             using Unicode encoding forms."

            "Unicode scalar value: Any Unicode code point except high-
             surrogate and low-surrogate code points."

           All the Unicode encodings are defined in terms of Unicode
           scalar value.  There is by definition no way to exchange
           these characters, and no meaning is attached to them.

           Clearly these are illegal in UTF-8.

         * U+FFFE and U+FFFF and the last two code points in every
           other Unicode plane are noncharacters.

           [Unicode code space is divided into 17 'planes' of 65,536
            characters, each.  So characters U+01_FFFE, U+01_FFFF,
            U+02_FFFE, U+02_FFFF, ... U+10_FFFE and U+10_FFFF are all
            noncharacters.]

           The range U+FDD0..U+FDEF are also noncharacters.

           Unicode 5.0.0 says:

            "Applications are free to use any of these noncharacter code
             points internally but should never attempt to exchange
             them. If a noncharacter is received in open interchange, an
             application is not required to interpret it in any way. It
             is good practice, however, to recognize it as a
             noncharacter and to take appropriate action, such as
             removing it from the text."

            "Noncharacter code points are reserved for internal use,
             such as for sentinel values. They should never be
             interchanged. They do, however, have well-formed
             representations in Unicode encoding forms and survive
             conversions between encoding forms. This allows sentinel
             values to be preserved internally across Unicode encoding
             forms, even though they are not designed to be used in open
             interchange."

           So, assuming UTF-8 is used for "open interchange", these are
           also invalid.

         * U+FFFD -- the Replacement Character

           Unicode 5.0.0 says:

            "U+FFFD replacement character is the general substitute
             character in the Unicode Standard. It can be substituted
             for any 'unknown' character in another encoding that cannot
             be mapped in terms of known Unicode characters."

           This is generally legal.

           However on the topic of "Reserved and Private-Use Character
           Codes" the standard also counsels:

            "An implementation should not blindly delete such
             characters, nor should it unintentionally transform them
             into something else."

Any corrections required would be appreciated, and may also inform any
"lurkers".

Internally, a byte encoding is needed. As a programmer I don't want to
be bothered with such implementation details.

Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
(UTF-8) are happy with.  Unicode defines 0xFFFE and 0xFFFF as
non-characters, not just 0xFFFF (which Encode::en/decode do deem
invalid).

Personally, I think Perl should accept these characters without warning,
except the strict UTF-8 encoding is requested (which differs from the
non-strict UTF8 encoding).

I agree -- chr(n) and 'utf8' (lax) should happily process anything
0..0x7FFF_FFFF as characters -- which may or may not be UCS.

I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
friends) in the same way as U+FFFF (and friends).

In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
neither.
It's supposed to be neither on the outside. Internally, it's utf8.
One can turn off the warnings and then chr(n) will happily take any +ve
integer and give you the equivalent character -- so the result is utf8,

The result is Unicode.

IMHO the result of chr(n) should just be a character.

The difference between Unicode and UTF8 is not
always clear, but in this case is: the character is Unicode, a single
codepoint, the internal implementation is UTF8.

Unicode: U+20AC    (one character: €)
UTF-8:   E2 82 AC  (three bytes)

I am under the impression that you know the difference and made an
honest mistake. My detailed expansion is also for lurkers and archives.

OK, sure.  I was using utf8 to mean any character value you like, and
UTF-8 to imply a value which is recognised in UCS -- rather than the
encoding.

[replacement character]
So we'll have to differ on this :-)

Yes, although my opinion on this is not strong. undef or replacement
character - both are good options. One argument in favor of the
replacement character would be backwards compatibility.

Well, having concluded that the result of chr(n) should be just a
character -- to be interpreted one way or another, later -- returning
"\xFFFD" for chr(-1) looks perverse !

FWIW I note that printf "%vX" is suggested as a means to render IPv6
addresses.  This implies the use of a string containing eight characters
0..0xFFFF as the packed form of IPv6.  Building one of those using
chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !

Chris
-- 
Chris Hall               highwayman.com            +44 7970 277 383

Attachment: signature.asc
Description: PGP signature