perl-unicode

[PATCH perlunicode.pod, encoding.pm] Implicit upgrading docs

2003-12-09 10:30:08
As a result of a recent discussion in perl-unicode, it was apparent
that these facts were generally unknown to most folks working with
Perl's unicode model:

    - Byte strings are upgraded to Unicode strings with "Latin-1".
    - Unicode strings are downgraded to Byte strings with "UTF-8".
    - One can change the "Latin-1" part above with "use encoding". 

The two patches below attempts to better document them, by adding a
Caveat item in perlunicode.pod, and add this information to the
encoding.pm module.

Also, perlunicode.pod used to say:

    If strings operating under byte semantics and strings with Unicode
    character data are concatenated, the new string will be upgraded to
    the new string will be upgraded to I<ISO 8859-1 (Latin-1)>

but this is wrong.  The new string will be upgraded to Unicode --
it's the old byte string that will be upgraded as Latin-1.  The patch
below also addresses this.

Thanks,
/Autrijus/

--- perlunicode.pod.orig        Tue Dec  9 19:50:32 2003
+++ perlunicode.pod     Tue Dec  9 20:22:37 2003
@@ -42,6 +42,21 @@
 You can also use the C<encoding> pragma to change the default encoding
 of the data in your script; see L<encoding>.
 
+=item C<use encoding> needed to upgrade non-Latin-1 byte strings
+
+By default, there is a fundamental asymmetry in Perl's unicode model:
+implicit upgrading from byte strings to Unicode strings assumes that
+they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
+downgraded with UTF-8 encoding.  This happens because the first 256
+codepoints in Unicode happens to agree with Latin-1.  
+
+If you wish to interpret byte strings as UTF-8 instead, use the
+C<encoding> pragma:
+
+    use encoding 'utf8';
+
+See L</"Byte and Character Semantics"> for more details.
+
 =back
 
 =head2 Byte and Character Semantics
@@ -86,12 +101,12 @@
 be used to force byte semantics on Unicode data.
 
 If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be upgraded to
-I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC.
-This translation is done without regard to the system's native 8-bit
-encoding, so to change this for systems with non-Latin-1 and 
-non-EBCDIC native encodings use the C<encoding> pragma.  See
-L<encoding>.
+character data are concatenated, the new string will be created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC.  This translation is done without
+regard to the system's native 8-bit encoding.  To change this for
+systems with non-Latin-1 and non-EBCDIC native encodings, use the
+C<encoding> pragma.  See L<encoding>.
 
 Under character semantics, many operations that formerly operated on
 bytes now operate on characters. A character in Perl is
--- encoding.pm.orig    Tue Dec  9 19:50:37 2003
+++ encoding.pm Tue Dec  9 20:32:15 2003
@@ -192,6 +192,25 @@
 
 You can override this by giving extra arguments; see below.
 
+=head2 Implicit upgrading for byte strings
+
+By default, if strings operating under byte semantics and strings
+with Unicode character data are concatenated, the new string will
+be created by decoding the byte strings as I<ISO 8859-1 (Latin-1)>.
+
+The B<encoding> pragma changes this to use the specified encoding
+instead.  For example:
+
+    use encoding 'utf8';
+    my $string = chr(20000); # a Unicode string
+    utf8::encode($string);   # now it's a UTF-8 encoded byte string
+    # concatenate with another Unicode string
+    print length($string . chr(20000));
+
+Will print C<2>, because C<$string> is upgraded as UTF-8.  Without
+C<use encoding 'utf8';>, it will print C<4> instead, since C<$string>
+is three octets when interpreted as Latin-1.
+
 =head1 FEATURES THAT REQUIRE 5.8.1
 
 Some of the features offered by this pragma requires perl 5.8.1.  Most

Attachment: pgpC8lECiUeeX.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>