perl-unicode

Re: Announce: Perl, Unicode and I18N FAQ

1999-12-17 18:03:09
First, Q8 seems to have moved down to between Q10 and Q11 for some reason.

: Unicode is stored as UTF-8 in Perl for 2 reasons: 
: 
:  1.UTF-8 is more compact for Western European languages than UTF-16,
:    because non-accented characters only require 1 byte in UTF-8.
:
:  2.Perl is written in C. Since C interprets the null character as end
:    of string, UTF-16, which often contains 0x00 with European
:    languages, cannot easily be used.

Hmm.  #1 is not the important reason, and #2 isn't strictly true, since
the C code in Perl doesn't much care whether Perl strings contain nulls.

The most important reason Perl supports UTF-8 is that I think almost
all data will be exchanged as UTF-8, so why do an extra conversion?
Especially when Perl is mostly used for flexible text processing
anyway, so it doesn't really matter much if the character encoding is
of variable length.

: A6. Lots of things to watch. 
: 
:     * use ctype character routines, not direct character comparison
:        (islower(), isalpha(), etc.) 
:     * use constants from limits.h 
:     * use unsigned char's and avoid sign extension problems 

This advice seems to be based on the notion that we're using ints to hold
characters.  People writing to the Perl API will need to deal instead
with UTF-8 strings, which means they'll have to use routines like

    isALPHA_utf8(p)

where p is a string pointer, not an int.  (There are also routines for
dealing with characters as integers, but they don't get used much.)

Since we're dealing with characters embedded in a UTF-8 string, there
are also special ways of advancing (and retreating) one character at
a time.  Typically you'll see loops like this:

    while (s < send && isALNUM_utf8(s))
        s += UTF8SKIP(s);

This code presumes you know you're dealing with a UTF-8 string.  It
would misbehave on ISO-8859-1, for instance, though it works on ASCII,
since ASCII is a subset of Unicode.

Larry