perl-unicode

utf8(3pm) man page :-(

2007-05-31 15:54:50
I have just read the utf8(3pm) man page as it comes with perl v5.8.8 and
I'm afraid, I found it *very* confusing and well below the generally
very high standards of clarity found in most of the Perl documentation.

There is a wild mixture of terminology that is never properly defined
anywhere. For example, no clear distinction is made whether "a string is
in UTF-8" means that the UTF-8 flag has been set (character semantics
versus byte semantics), or whether the string's internal representation
does not contain any malformed UTF-8 byte sequences, or both, or neither.

Basically, I have not understood without great doubt and uncertainty
what any of the "utility functions" described really do, that is whether
they only affect the byte/character flag of a string or whether (and
under which conditions exactly) they also change the byte sequence
itself.

There are a number of applications in which a Perl developper is
continuously dealing with a mixture of both byte and character
sequences, and these will not go away. Think about binary file formats
or machine code (byte sequences) that contains embedded UTF-8 strings
(character sequences) that each need to be treated as such, but that
also need to be concatenated or separated in various ways. Or think
about Perl code that robustly searches for and prints diagnostics about
malformed UTF-8 sequences. In such applications, the low-level control
over the byte-versus-character nature of a Perl string that the utf8::
functions provide is extremely important, and a clearer writeup of what
exactly they do would be very helpful.

Given how important these functions are for such applications, the many
references to "this may change in the future" are also adding a lot of
fear, uncertainty and doubt to anyone who wants to use them. :-(

Example:

       Utility functions

       The following functions are defined in the "utf8::" package by the Perl
       core.  You do not need to say "use utf8" to use these and in fact you
       should not say that  unless you really want to have UTF-8 source code.

       * $num_octets = utf8::upgrade($string)
           Converts in-place the octet sequence in the native encoding
           (Latin-1 or EBCDIC) to the equivalent character sequence in UTF-X.
           [What exactly is meant by "native encoding" these days?]
           $string already encoded as characters does no harm.
           [What does "no harm" mean exactly?]
           Returns the number of octets necessary to represent the string as 
UTF-X.
           [Examples of all the major cases how this function can behave?]

           Can be used to make sure that the UTF-8 flag is on [is that all
           it does?], so that "\w" or "lc()" work as Unicode [?] on strings
           containing [UTF-8?] characters in the range 0x80-0xFF
           (on ASCII and derivatives [?]).

           Note that this function does not handle arbitrary encodings.
           [Which cases does it handle?]
           Therefore Encode.pm is recommended for the general purposes.
           [Example?]

           Affected by the encoding pragma. [How?]

       * $success = utf8::downgrade($string[, FAIL_OK])
           Converts in-place the character sequence in UTF-X to the equivalent
           octet sequence in the native encoding (Latin-1 or EBCDIC).  $string
           already encoded as octets does no harm.  Returns true on success.
           On failure dies or, if the value of "FAIL_OK" is true, returns
           false.  Can be used to make sure that the UTF-8 flag is off, e.g.
           when you want to make sure that the substr() or length() function
           works with the usually faster byte algorithm.

           Note that this function does not handle arbitrary encodings.
           Therefore Encode.pm is recommended for the general purposes.

           [Same problems as above]

           Not affected by the encoding pragma.

           NOTE: this function is experimental and may change or be removed
           without notice. [:-(]

       * utf8::encode($string)
           Converts in-place the character sequence to the corresponding octet
           sequence in UTF-X.  The UTF-8 flag is turned off.  Returns nothing.
           [Does this mean, that the byte sequence is never touched and
           all this function does is to turn off the UTF-8 flag?]

           Note that this function does not handle arbitrary encodings.
           Therefore Encode.pm is recommended for the general purposes [?].

       * utf8::decode($string)
           Attempts to convert in-place the octet sequence in UTF-X to the
           corresponding character sequence.  The UTF-8 flag is turned on only
           if the source string contains multiple-byte UTF-X characters.  If
           $string is invalid as UTF-X, returns false; otherwise returns true.

           Note that this function does not handle arbitrary encodings.
           Therefore Encode.pm is recommended for the general purposes.

           NOTE: this function is experimental and may change or be removed
           without notice. [:-( why?]

       * $flag = utf8::is_utf8(STRING)
           (Since Perl 5.8.1)  Test whether STRING is in UTF-8.  Functionally
           the same as Encode::is_utf8().
           [Does this just return the UTF-8 flag, or does it test the
           string, and if the latter, against what exact regexp?]

       * $flag = utf8::valid(STRING)
           [INTERNAL] Test whether STRING is in a consistent state regarding
           UTF-8. [What exactly does this mean?]
           Will return true is [sic!] well-formed UTF-8 and has the UTF-8
           flag on or if string is held as bytes (both these states are 'con-
           sistent').  Main reason for this routine is to allow Perl's test-
           suite to check that operations have left strings in a consistent
           state.  You most probably want to use utf8::is_utf8() instead.

       "utf8::encode" is like "utf8::upgrade", but the UTF8 flag is cleared.
       [Also, one required a character sequence, the other an octet sequence!]

       See perlunicode for more on the UTF8 flag and the C API functions
       "sv_utf8_upgrade", "sv_utf8_downgrade", "sv_utf8_encode", and
       "sv_utf8_decode", which are wrapped by the Perl functions
       "utf8::upgrade", "utf8::downgrade", "utf8::encode" and "utf8::decode".
       Note that in the Perl 5.8.0 and 5.8.1 implementation the functions
       utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode, utf8::upgrade,
       and utf8::downgrade are always available, without a "require utf8"
       statement-- this may change in future releases.

It would be great if there is any expert here who really understands
this API and who could clarify the writing somewhat. Some other parts of
the Perl Unicode documentation are also not yet shining examples of
clear writing. Thanks!

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain

<Prev in Thread] Current Thread [Next in Thread>
  • utf8(3pm) man page :-(, Markus Kuhn <=