On Wed, 12 Mar 2008 Juerd Waalboer wrote
Chris Hall skribis 2008-03-12 20:49 (+0000):
a. are you saying that characters in Perl are Unicode ?
Yes. They are called Unicode, at least. This has my preference for
explanation and documentation.
b. or are you agreeing that characters in Perl take values
0..0x7FFF_FFFF (or beyond), which are generally interpreted as
UCS, where required and possible ?
This too. This is the more technically accurate explanation, and has my
preference for implementation.
'This too' ? Goodness, superimposition ! Perl and quantum mechanics ?
Suddenly it all becomes clear. Or at least as clear as the uncertainty
principle will allow !-)
FWIW, I have tried some of the HTTP, HTML and XML modules. The warnings
that pop out every now and then about Unicode or UTF-8 or whatever are
less than useful and more than irritating !
If (a) then characters with ordinals beyond 0x10_FFFF should throw
warnings (at least) since they clearly are not Unicode !
Perl just has a somewhat broad definition of "unicode", that is not
the same as the official unicode character set.
BTW, in "2.14 Conforming to the Unicode Standard" I found this gem:
Unacceptable Behavior
It is unacceptable for a conforming implementation:
- To use unassigned codes.
• U+2073 is unassigned and not usable for ‘3’ (superscript 3) or
any other character.
This appears to say that unassigned codes should not be transmitted out,
same like non-characters ! Which looks like hard work. (On the other
hand, applications are supposed to cope with future defined code
points...)
Should 'UTF-8' be strict about unassigned codes as well ? What should
chr() and "\x{...}" etc. do ?
This reinforces my view that chr(n) is (a) wrong to whinge about
surrogates and non-characters, and (b) wrong to return a character for n
outside 0x..7FFF_FFFF. IMO:
- chr() shouldn't worry about strict UCS ...
- ... and doesn't, in an case, do a complete job
[it does spot all non-characters and surrogates, but ignores
unassigned codes.]
- ... however, non-characters are perfectly legal UCS, at least for
internal use. One can argue for jumping all over these when
outputting (strict) UTF-8 for external exchange.
- ... and 0x11_FFFE is not defined by UCS to be a non-character,
it's not defined in UCS at all, any more than any other character
code > U+10_FFFF !
- chr(n) doesn't whinge about characters > U+10_FFFF ! (Except for
the non-characters it has invented !)
- the answer to chr(-1) is 'not a character at all' -- it isn't 'the
character that stands in place of some unknown character'
- the utility of characters > 0x7FFF_FFFF is not worth (a) the kludge
required to extend utf8, or (b) the interoperability issues.
Even encode/decode 'utf8' take a dim view of chars > 0x7FFF_FFFF.
I note that utf8::valid() rejects characters > 0x7FFF_FFFF !
- chr(n) accepts characters > 0x7FFF_FFFF, even though the result
is not valid per utf8::valid() !!
- chr(n) warns about p + 0xFFFE and p + 0xFFFF for every value of 'p',
even those which are beyond the Unicode range !
It has its own utf8, it can have its own unicode too :)
And there was I thinking that things were already sufficiently confused
:-}
The 'utf8' decode does the Right Thing -- it decodes well-formed UTF-8
up to 0x7FFF_FFFF and handles errors and incomplete sequences and
doesn't concern itself with the minutiae of UCS (surrogates,
non-characters and unassigned codes).
This is nicely consistent with utf8::valid().
[The only thing I would argue about is the separate treatment of each
byte of an invalid sequence -- I'd be tempted to treat 0x00..0x7F and
0xC0..0xFF as terminators of an invalid sequence and 0x80..0xBF as
members of an invalid sequence.]
If 'unicode' were to follow that model, then chr() and friends could
stop throwing (spurious) warnings around the place.
Sadly, 'utf8' encode is doesn't care, and outputs whatever is in the
string -- including redundant sequences, invalid sequences, incomplete
sequences and Perl's extended sequences for > 0x7FFF_FFFF. That is, it
will happily output something that utf8::valid would reject. Note that
this "encoding" is outputting something that 'utf8' decode won't accept.
If you really want what 'utf8' encode currently does you can force
characters to octets (wax off) and output. The reverse is to input the
octets and force to characters (wax on).
Summary of Observations
-----------------------
* chr(n) and friends are broken:
- they winge about things that are none of their business, which is
not consistent with the notion of (lax) 'unicode'.
- the wingeing about not-(strict)-Unicode is, moreover, incomplete
(unassigned codes and codes beyond the UCS range are allowed !)
- non-characters are perfectly legal -- just not suitable for
external exchange.
- projecting non-characters beyond the UCS range is plain odd.
- they create invalid (per utf8::valid()) strings
- invalid 'n' should return an 'invalid' (i.e. undef) response
* 'utf8' encode is broken:
- it should not output stuff that is not at least utf8:valid()
- it should be symmetrical with 'utf8' decode
* characters > 0x7FFF_FFFF are not utf8::valid. I think that's a
good call -- but Perl is not consistent, and will happily produce
invalid strings...
* 'UTF-8' is broken:
- it doesn't know about all the defined non-characters.
- there should be an option to allow non-characters for internal
exchange of otherwise strict UTF-8.
- BTW: the Unicode reference code for UTF8 to UTF32 does not trouble
itself about non-characters. Nor does UTF32 to UTF8.
Chris
--
Chris Hall highwayman.com
signature.asc
Description: PGP signature