perl-unicode

Re: Source data for perl encodings

2001-01-08 07:10:54
James points out that Unicode have tightened up definition of UTF-8.

So perl's internal encoding may need careful definition to keep 
the pedants at bay.

(And IIRC Tcl was using a non-minimal length encoding of '\0' to 
allow strlen() to skip "embedded NUL" - so they have a problem ;-)
)

James <james(_at_)rf(_dot_)net> writes:

This is the same question every i18N person will ask, because with
i18N the answer is always, "It depends."

Do you mean unicode or do you mean ISO 10646?

I am not an expert on the differences. Perl characters are now "logically"
(up to at least) 32-bit values held internally as UTF-8 encoded strings.
The language visible properties (case, alpha-ness, digit-ness, ...)
are derived from the tables at ftp.unicode.org - the 3.0.1 version.

Unicode and the ISO folks have agreed to use the same codepoints.

However, the ISO folks have not adopted Unicode's extensive property
and algorithm enhancements to the raw codepoint tables.

So for now as long as Perl obeys the Unicode standard and TRs, then
ISO 10646 can be ignored.

And if it doesn't obey all the rules? - can we claim ISO 10646 even
though we don't reach Unicode status?
Not that we are going to deliberately break the rules but unless
someone does an "audit" we will not be sure...


I have the impression that the ISO people favor UTF-32 more and more, so
enabling conversion from UTF-8 to UTF-32 is a helpful consideration.

Unicode Consortium FAQ: Unicode & ISO 10646
http://www.unicode.org/unicode/faq/unicode_iso.html

Also, you're probably already aware of the UTF-8 Corrigendum announcement
from Nov. 30:

"The Unicode Technical Committee has modified the definition of UTF-8 to
forbid conformant implementations from interpreting non-shortest forms for
BMP characters, and clarified some of the conformance clauses. For more
information, see
http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html

James.
-- 
Nick Ing-Simmons <nik(_at_)tiuk(_dot_)ti(_dot_)com>
Via, but not speaking for: Texas Instruments Ltd.