perl-unicode

Encode::FB_QUIET and incomplete characters

2004-12-02 08:30:13
The Encode manpage says this about FB_QUIET:

|     CHECK = Encode::FB_QUIET
|
|       If CHECK is set to Encode::FB_QUIET, (en|de)code will
|       immediately return the portion of the data that has been
|       processed so far when an error occurs. The data argument will
|       be overwritten with everything after that point (that is, the
|       unprocessed part of data).  This is handy when you have to
|       call decode repeatedly in the case where your source data may
|       contain partial multi-byte character sequences, for example
|       because you are reading with a fixed-width buffer. Here is
|       some sample code that does exactly this:
|
|         my $data = ''; my $utf8 = '';
|         while(defined(read $fh, $buffer, 256)){
|           # buffer may end in a partial character so we append
|           $data .= $buffer;
|           $utf8 .= decode($encoding, $data, Encode::FB_QUIET);
|           # $data now contains the unprocessed partial character
|         }

First off this sample code is no good since this loop will normally
never terminate as read() only returns undef on failure and EOF is not
a failure.

Second we will end up accumulating the resf of the file in $buffer as
soon as we encounter a bad byte in the stream.  We need to distinguish
between bad stuff and incomplete sequences.  Also note that an
incomplete sequences at EOF is bad stuff.

I believe this function will do the right thing:

    use Encode;
    
    sub read_utf8 {
        my($fh, $bad_byte_cb) = @_;
    
        my $str = "";  # where we accumulate the result
        my $buf = "";
        my $n;
    
        do {
            $n = read($fh, $buf, 16, length($buf));
            die "Can't read: $!" unless defined $n;
            while (length $buf) {
                $str .= Encode::decode("UTF-8", $buf, Encode::FB_QUIET);
                last if $n && length($buf) < 4; # possibly an incomplete char
                if (length($buf)) {
                    my $bad_byte = substr($buf, 0, 1, "");
                    $str .= &$bad_byte_cb(ord($bad_byte)) if $bad_byte_cb;
                }
            }
        } while $n;
    
        return $str;
    }

    # test it     
    use Data::Dump;
    print Data::Dump::dump(read_utf8(*STDIN, sub { sprintf "%%%02X", shift })), 
"\n";

so I suggest adding this as a example to the documentation.  What I
don't like here is the test for incomplete char.  What I really want
is for Encode::decode() to tell me what the situation is, so I want to
extend its API.  The simplest way seems to just add another argument
that is updated to reflect this status.

    Encode::decode("UTF-8", $buf, Encode::FB_QUIET, $incomplete);

where $incomplete will be TRUE iff there is stuff left in $buf and the
reason is that more data is needed to decode properly.  Is this an
acceptable extension?

Regards,
Gisle

<Prev in Thread] Current Thread [Next in Thread>