On Tue, 11 Mar 2008 Juerd Waalboer wrote
Chris Hall skribis 2008-03-11 21:09 (+0000):
OK. In the meantime IMHO chr(n) should be handling utf8 and has no
business worrying about things which UTF-8 or UCS think aren't
characters.
It should do Unicode, not any specific byte encoding, like UTF-?8.
IMHO chr(n) should do characters, which may be interpreted as per
Unicode, but may not.
When I said utf8 I was following the (sloppy) convention that utf8 means
how Perl handles characters in strings...
...the naming is a cause of confusion. For the avoidance of doubt, this
is what I understand the position to be:
a. characters in Perl have integer values in 0..0x7FFF_FFFF (or more).
It appears that what is actually going on is that the limit is
the local unsigned Perl integer. One can debate the marginal
utility of that vs the scope for confusion.
b. in a Perl string, characters are held in a UTF-8 like form.
Where UTF-8 (upper case, with hyphen) means the RFC 3629 &
Unicode Consortium defined byte-wise encoding.
Current UTF-8 defines encoding for values 0..0xD7FF and
0xE000..0x10_FFFF, which is exactly the current UCS range (less
the 'surrogates').
Note that this limits UTF-8 to 4 byte sequences, explicitly
excluding:
* sequences that have shorter equivalents ('redundant')
* 0xD800..0xDFFF -- the 'surrogates'
* x11_0000..0x1F_FFFF -- beyond UCS range
Older versions of the standard allowed for values 0..0x7FFF_FFFF,
but also excluded the 'redundant' sequences and (I believe) the
'surrogates'.
The encoding used by Perl stretches the range to 2^72-1. This
is incompatible with even the older versions of UTF-8.
This form is referred to as utf8 (lower case, no hyphen).
There is really no need to discuss this, except in the context of
messing around in guts of Perl.
c. when Perl wishes to assign some meaning to a character value
it interprets it as a Unicode Code Point, if it can.
There are huge areas of the Unicode space that have no current
meaning. There are areas which may have local meaning ("Private
Use"). In addition Perl allows character values that are beyond
current Unicode space.
In the abstract, characters in Perl are not Unicode (UCS). But
most of the time one treats them as if they were.
String literals are represented by UCS code points. Which
reinforces the feeling that characters in Perl are Unicode.
'C' uses 'wide' to refer to characters that may have values
> 255. IMHO it's a shame that Perl did not follow this.
d. when exchanging character data with other systems one needs to
deal with character set and encoding issues.
The 'UTF-8' encoding (character set) covers the UCS character set
(values 0..0x10_FFFF, currently) and the (current) standard UTF-8
encoding. 'UTF-8' also worries about some 'suspect' (my term) UCS
values, see below.
The 'utf8' encoding (character set) is a super set of current UTF-8
(values 0..0x7FFF_FFFF) -- corresponding to earlier UTF-8. 'utf8'
does not concern itself about any 'suspect' UCS values.
[Actually, that's not entirely true. 'utf8' encode happily deals
with characters all the way up to 2^64-1 (and perhaps, beyond),
using Perl's extended encoding. However, 'utf8' decode treats
anything > 0x7FFF_FFFF as invalid.]
e. The 'suspect' UCS values.
These are:
* U+D800..U+DBFF and U+DC00..U+DFFF (High- and Low-surrogate,
respectively). Where these are used they should appear in
pairs, High followed by Low.
Unicode 5.0.0 says:
"Surrogate pairs are used only in UTF-16."
"Isolated surrogate code units have no interpretation on
their own."
"Surrogate code points cannot be conformantly interchanged
using Unicode encoding forms."
"Unicode scalar value: Any Unicode code point except high-
surrogate and low-surrogate code points."
All the Unicode encodings are defined in terms of Unicode
scalar value. There is by definition no way to exchange
these characters, and no meaning is attached to them.
Clearly these are illegal in UTF-8.
* U+FFFE and U+FFFF and the last two code points in every
other Unicode plane are noncharacters.
[Unicode code space is divided into 17 'planes' of 65,536
characters, each. So characters U+01_FFFE, U+01_FFFF,
U+02_FFFE, U+02_FFFF, ... U+10_FFFE and U+10_FFFF are all
noncharacters.]
The range U+FDD0..U+FDEF are also noncharacters.
Unicode 5.0.0 says:
"Applications are free to use any of these noncharacter code
points internally but should never attempt to exchange
them. If a noncharacter is received in open interchange, an
application is not required to interpret it in any way. It
is good practice, however, to recognize it as a
noncharacter and to take appropriate action, such as
removing it from the text."
"Noncharacter code points are reserved for internal use,
such as for sentinel values. They should never be
interchanged. They do, however, have well-formed
representations in Unicode encoding forms and survive
conversions between encoding forms. This allows sentinel
values to be preserved internally across Unicode encoding
forms, even though they are not designed to be used in open
interchange."
So, assuming UTF-8 is used for "open interchange", these are
also invalid.
* U+FFFD -- the Replacement Character
Unicode 5.0.0 says:
"U+FFFD replacement character is the general substitute
character in the Unicode Standard. It can be substituted
for any 'unknown' character in another encoding that cannot
be mapped in terms of known Unicode characters."
This is generally legal.
However on the topic of "Reserved and Private-Use Character
Codes" the standard also counsels:
"An implementation should not blindly delete such
characters, nor should it unintentionally transform them
into something else."
Any corrections required would be appreciated, and may also inform any
"lurkers".
Internally, a byte encoding is needed. As a programmer I don't want to
be bothered with such implementation details.
Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
(UTF-8) are happy with. Unicode defines 0xFFFE and 0xFFFF as
non-characters, not just 0xFFFF (which Encode::en/decode do deem
invalid).
Personally, I think Perl should accept these characters without warning,
except the strict UTF-8 encoding is requested (which differs from the
non-strict UTF8 encoding).
I agree -- chr(n) and 'utf8' (lax) should happily process anything
0..0x7FFF_FFFF as characters -- which may or may not be UCS.
I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
friends) in the same way as U+FFFF (and friends).
In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it's
neither.
It's supposed to be neither on the outside. Internally, it's utf8.
One can turn off the warnings and then chr(n) will happily take any +ve
integer and give you the equivalent character -- so the result is utf8,
The result is Unicode.
IMHO the result of chr(n) should just be a character.
The difference between Unicode and UTF8 is not
always clear, but in this case is: the character is Unicode, a single
codepoint, the internal implementation is UTF8.
Unicode: U+20AC (one character: €)
UTF-8: E2 82 AC (three bytes)
I am under the impression that you know the difference and made an
honest mistake. My detailed expansion is also for lurkers and archives.
OK, sure. I was using utf8 to mean any character value you like, and
UTF-8 to imply a value which is recognised in UCS -- rather than the
encoding.
[replacement character]
So we'll have to differ on this :-)
Yes, although my opinion on this is not strong. undef or replacement
character - both are good options. One argument in favor of the
replacement character would be backwards compatibility.
Well, having concluded that the result of chr(n) should be just a
character -- to be interpreted one way or another, later -- returning
"\xFFFD" for chr(-1) looks perverse !
FWIW I note that printf "%vX" is suggested as a means to render IPv6
addresses. This implies the use of a string containing eight characters
0..0xFFFF as the packed form of IPv6. Building one of those using
chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !
Chris
--
Chris Hall highwayman.com +44 7970 277 383
signature.asc
Description: PGP signature