perl-unicode

Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-13 10:16:13
On Wed, 12 Mar 2008 Juerd Waalboer wrote
Chris Hall skribis 2008-03-12 20:49 (+0000):
  a. are you saying that characters in Perl are Unicode ?

Yes. They are called Unicode, at least. This has my preference for
explanation and documentation.

  b. or are you agreeing that characters in Perl take values
     0..0x7FFF_FFFF (or beyond), which are generally interpreted as
     UCS, where required and possible ?

This too. This is the more technically accurate explanation, and has my
preference for implementation.

'This too' ? Goodness, superimposition ! Perl and quantum mechanics ? Suddenly it all becomes clear. Or at least as clear as the uncertainty principle will allow !-)

FWIW, I have tried some of the HTTP, HTML and XML modules. The warnings that pop out every now and then about Unicode or UTF-8 or whatever are less than useful and more than irritating !

If (a) then characters with ordinals beyond 0x10_FFFF should throw
warnings (at least) since they clearly are not Unicode !

Perl just has a somewhat broad definition of "unicode", that is not
the same as the official unicode character set.

BTW, in "2.14 Conforming to the Unicode Standard" I found this gem:

  Unacceptable Behavior

  It is unacceptable for a conforming implementation:

   - To use unassigned codes.

       • U+2073 is unassigned and not usable for ‘3’ (superscript 3) or
         any other character.

This appears to say that unassigned codes should not be transmitted out, same like non-characters ! Which looks like hard work. (On the other hand, applications are supposed to cope with future defined code points...)

Should 'UTF-8' be strict about unassigned codes as well ? What should chr() and "\x{...}" etc. do ?

This reinforces my view that chr(n) is (a) wrong to whinge about surrogates and non-characters, and (b) wrong to return a character for n outside 0x..7FFF_FFFF. IMO:

  - chr() shouldn't worry about strict UCS ...

  - ... and doesn't, in an case, do a complete job
    [it does spot all non-characters and surrogates, but ignores
     unassigned codes.]

  - ... however, non-characters are perfectly legal UCS, at least for
    internal use.  One can argue for jumping all over these when
    outputting (strict) UTF-8 for external exchange.

  - ... and 0x11_FFFE is not defined by UCS to be a non-character,
    it's not defined in UCS at all, any more than any other character
    code > U+10_FFFF !

  - chr(n) doesn't whinge about characters > U+10_FFFF !  (Except for
    the non-characters it has invented !)

  - the answer to chr(-1) is 'not a character at all' -- it isn't 'the
    character that stands in place of some unknown character'

  - the utility of characters > 0x7FFF_FFFF is not worth (a) the kludge
    required to extend utf8, or (b) the interoperability issues.

    Even encode/decode 'utf8' take a dim view of chars > 0x7FFF_FFFF.

    I note that utf8::valid() rejects characters > 0x7FFF_FFFF !

  - chr(n) accepts characters > 0x7FFF_FFFF, even though the result
    is not valid per utf8::valid() !!

  - chr(n) warns about p + 0xFFFE and p + 0xFFFF for every value of 'p',
    even those which are beyond the Unicode range !

It has its own utf8, it can have its own unicode too :)

And there was I thinking that things were already sufficiently confused :-}

The 'utf8' decode does the Right Thing -- it decodes well-formed UTF-8 up to 0x7FFF_FFFF and handles errors and incomplete sequences and doesn't concern itself with the minutiae of UCS (surrogates, non-characters and unassigned codes).

This is nicely consistent with utf8::valid().

[The only thing I would argue about is the separate treatment of each byte of an invalid sequence -- I'd be tempted to treat 0x00..0x7F and 0xC0..0xFF as terminators of an invalid sequence and 0x80..0xBF as members of an invalid sequence.]

If 'unicode' were to follow that model, then chr() and friends could stop throwing (spurious) warnings around the place.

Sadly, 'utf8' encode is doesn't care, and outputs whatever is in the string -- including redundant sequences, invalid sequences, incomplete sequences and Perl's extended sequences for > 0x7FFF_FFFF. That is, it will happily output something that utf8::valid would reject. Note that this "encoding" is outputting something that 'utf8' decode won't accept.

If you really want what 'utf8' encode currently does you can force characters to octets (wax off) and output. The reverse is to input the octets and force to characters (wax on).

Summary of Observations
-----------------------

  * chr(n) and friends are broken:

    - they winge about things that are none of their business, which is
      not consistent with the notion of (lax) 'unicode'.

    - the wingeing about not-(strict)-Unicode is, moreover, incomplete
      (unassigned codes and codes beyond the UCS range are allowed !)

    - non-characters are perfectly legal -- just not suitable for
      external exchange.

    - projecting non-characters beyond the UCS range is plain odd.

    - they create invalid (per utf8::valid()) strings

    - invalid 'n' should return an 'invalid' (i.e. undef) response

  * 'utf8' encode is broken:

    - it should not output stuff that is not at least utf8:valid()

    - it should be symmetrical with 'utf8' decode

  * characters > 0x7FFF_FFFF are not utf8::valid.  I think that's a
    good call -- but Perl is not consistent, and will happily produce
    invalid strings...

  * 'UTF-8' is broken:

    - it doesn't know about all the defined non-characters.

    - there should be an option to allow non-characters for internal
      exchange of otherwise strict UTF-8.

    - BTW: the Unicode reference code for UTF8 to UTF32 does not trouble
           itself about non-characters.  Nor does UTF32 to UTF8.

Chris
--
Chris Hall               highwayman.com

Attachment: signature.asc
Description: PGP signature