Re: utf8::valid and \x14_000

On Wed, 12 Mar 2008 Juerd Waalboer wrote

Chris Hall skribis 2008-03-12 20:49 (+0000):

  a. are you saying that characters in Perl are Unicode ?

Yes. They are called Unicode, at least. This has my preference for
explanation and documentation.

  b. or are you agreeing that characters in Perl take values
     0..0x7FFF_FFFF (or beyond), which are generally interpreted as
     UCS, where required and possible ?

This too. This is the more technically accurate explanation, and has my
preference for implementation.

'This too' ? Goodness, superimposition ! Perl and quantum mechanics ?Suddenly it all becomes clear. Or at least as clear as the uncertaintyprinciple will allow !-)

FWIW, I have tried some of the HTTP, HTML and XML modules. The warningsthat pop out every now and then about Unicode or UTF-8 or whatever areless than useful and more than irritating !

If (a) then characters with ordinals beyond 0x10_FFFF should throw
warnings (at least) since they clearly are not Unicode !

Perl just has a somewhat broad definition of "unicode", that is not
the same as the official unicode character set.


BTW, in "2.14 Conforming to the Unicode Standard" I found this gem:

  Unacceptable Behavior

  It is unacceptable for a conforming implementation:

   - To use unassigned codes.

       • U+2073 is unassigned and not usable for ‘3’ (superscript 3) or
         any other character.

This appears to say that unassigned codes should not be transmitted out,same like non-characters ! Which looks like hard work. (On the otherhand, applications are supposed to cope with future defined codepoints...)

Should 'UTF-8' be strict about unassigned codes as well ? What shouldchr() and "\x{...}" etc. do ?

This reinforces my view that chr(n) is (a) wrong to whinge aboutsurrogates and non-characters, and (b) wrong to return a character for noutside 0x..7FFF_FFFF. IMO:


  - chr() shouldn't worry about strict UCS ...

  - ... and doesn't, in an case, do a complete job
    [it does spot all non-characters and surrogates, but ignores
     unassigned codes.]

  - ... however, non-characters are perfectly legal UCS, at least for
    internal use.  One can argue for jumping all over these when
    outputting (strict) UTF-8 for external exchange.

  - ... and 0x11_FFFE is not defined by UCS to be a non-character,
    it's not defined in UCS at all, any more than any other character
    code > U+10_FFFF !

  - chr(n) doesn't whinge about characters > U+10_FFFF !  (Except for
    the non-characters it has invented !)

  - the answer to chr(-1) is 'not a character at all' -- it isn't 'the
    character that stands in place of some unknown character'

  - the utility of characters > 0x7FFF_FFFF is not worth (a) the kludge
    required to extend utf8, or (b) the interoperability issues.

    Even encode/decode 'utf8' take a dim view of chars > 0x7FFF_FFFF.

    I note that utf8::valid() rejects characters > 0x7FFF_FFFF !

  - chr(n) accepts characters > 0x7FFF_FFFF, even though the result
    is not valid per utf8::valid() !!

  - chr(n) warns about p + 0xFFFE and p + 0xFFFF for every value of 'p',
    even those which are beyond the Unicode range !

It has its own utf8, it can have its own unicode too :)

And there was I thinking that things were already sufficiently confused:-}

The 'utf8' decode does the Right Thing -- it decodes well-formed UTF-8up to 0x7FFF_FFFF and handles errors and incomplete sequences anddoesn't concern itself with the minutiae of UCS (surrogates,non-characters and unassigned codes).


This is nicely consistent with utf8::valid().

[The only thing I would argue about is the separate treatment of eachbyte of an invalid sequence -- I'd be tempted to treat 0x00..0x7F and0xC0..0xFF as terminators of an invalid sequence and 0x80..0xBF asmembers of an invalid sequence.]

If 'unicode' were to follow that model, then chr() and friends couldstop throwing (spurious) warnings around the place.

Sadly, 'utf8' encode is doesn't care, and outputs whatever is in thestring -- including redundant sequences, invalid sequences, incompletesequences and Perl's extended sequences for > 0x7FFF_FFFF. That is, itwill happily output something that utf8::valid would reject. Note thatthis "encoding" is outputting something that 'utf8' decode won't accept.

If you really want what 'utf8' encode currently does you can forcecharacters to octets (wax off) and output. The reverse is to input theoctets and force to characters (wax on).


Summary of Observations
-----------------------

  * chr(n) and friends are broken:

    - they winge about things that are none of their business, which is
      not consistent with the notion of (lax) 'unicode'.

    - the wingeing about not-(strict)-Unicode is, moreover, incomplete
      (unassigned codes and codes beyond the UCS range are allowed !)

    - non-characters are perfectly legal -- just not suitable for
      external exchange.

    - projecting non-characters beyond the UCS range is plain odd.

    - they create invalid (per utf8::valid()) strings

    - invalid 'n' should return an 'invalid' (i.e. undef) response

  * 'utf8' encode is broken:

    - it should not output stuff that is not at least utf8:valid()

    - it should be symmetrical with 'utf8' decode

  * characters > 0x7FFF_FFFF are not utf8::valid.  I think that's a
    good call -- but Perl is not consistent, and will happily produce
    invalid strings...

  * 'UTF-8' is broken:

    - it doesn't know about all the defined non-characters.

    - there should be an option to allow non-characters for internal
      exchange of otherwise strict UTF-8.

    - BTW: the Unicode reference code for UTF8 to UTF32 does not trouble
           itself about non-characters.  Nor does UTF32 to UTF8.

Chris
--
Chris Hall               highwayman.com

signature.asc
Description: PGP signature

Re: utf8::valid and \x14_000 - \x1F_0000