perl-unicode

Re: UTF-16LE fails in substitution

2005-09-22 03:19:43

stevlars(_at_)microsoft(_dot_)com said:
I may be a little confused here still.  The help that I included in the
first post said that "UTF-16 itself can be used for in-memory
computations, but if storage or transfer is required either UTF-16BE
(big-endian) or UTF-16LE (little-endian) encodings must be chosen"

I also find this extract from the perlunicode man page a bit confusing, 
given that Perl 5.8's native internal representation for unicode 
characters is always utf8, and everyone would normally want to have their 
text in utf8 form for doing things like regex match/replacement, substr, 
index, and so on -- all that stuff works transparently on utf8 strings, 
and this fact will greatly simplify the code you're trying to write.

As for line termination patterns on output, you probably need to control
that separately, either by setting "$\" or using the ":crlf" IO-layer.

I don't understand :>)...

In addition to a modified version of your code snippet, I'll attach a 
little "tutorial/test" script that I hope will be instructive -- play with 
that to see what different variations you can get with line terminations.

As for improving your script, here's one approach: I have recast it as a
subroutine to be called with the file name and "build number" value to be
inserted into the file.  By making it a sub that returns "1" for success or
undef for failure, I leave it up to the caller to decide whether a failure 
should cause the whole process to die -- i.e. you'd call it like this:

    if ( updateFile( $working_file_list[0], $build_number )) {
        print "update succeeded on $working_file_list[0]\n";
        # do other stuff...
    }
    else {
        print "update failed on $working_file_list[0]\n";
        # do other stuff ?
    }

My point about improving the error handling is that in your version, if
open failed on a file (input or output), you just showed a warning and kept
going through steps that depend on the open being successful.  In the
subroutine version, we can return undef as soon as there's a problem, and
don't compound error conditions on top of one another.

        Dave Graff

------------- version-updating snippet (subroutine version) ----------

use strict;
use Encode;
use Encode::Guess qw/UTF-16LE UTF16-BE ascii utf8/;

sub updateFile
{
    my ($filename, $bldnumber) = @_;
    my $filedata = '';

# read file to be updated

    if ( ! open( F, $filename )) {
        warn "$filename: open for read failed: $!\n";
        return;
    }
    else {
        local $/ = undef;  # use "slurp mode" to read full file content
        $filedata = <F>;
        close F;
    }

    my $decoder = guess_encoding( $filedata );
    if ( ! ref($decoder)) {
        warn "$filename: can't guess encoding: $decoder";
        return;
    }
    my $writemode = ">:encoding(" . $decoder->name . ")";

    my $editdata = $decoder->decode( $filedata );
    my $comment = "\n<!-- Build Version:$bldnumber -->\n";

    $editdata =~ s/((?:<\?.*?\?>\s*)*)/$1$comment/s; 
       # that substitution cannot fail --
       # $1 may be blank, putting $comment at beginning of string

    if ( ! open( F, $writemode, $filename )) {
        warn "$filename: open for write failed: $!\n";
        return;
    }
    print F $editdata;
    close F;

    return 1;
}

---------- end of subroutine ---- next: a toy test script ----------

#!/usr/bin/perl

use strict;
use Encode;
use Encode::Guess;

# create some test data -- the non-ascii strings are actually all utf8:

my %testStrings = ('ascii'    => "an ascii string",
                   'utf8'     => "utf8 having a \x{20ac}",
                   'UTF-16LE' => "UTF-16LE with \x{20ac}",
                   'UTF-16BE' => "UTF-16BE with \x{20ac}",
                   'UTF-16'   => "UTF-16 native \x{20ac}",
                   );

for my $enc ( sort keys %testStrings ) {
    my $mode = ">";
    $mode .= ( $enc =~ /UTF-16/ ) ? ":encoding($enc)" :
             ( $enc eq "utf8" ) ? ":$enc" : "";

    open my $fh, $mode, "test.$enc.txt"
        or die "$enc: open failed for write: $!";
    binmode( $fh, ":crlf" );

    warn "writing $enc via $mode\n";
    print $fh "$testStrings{$enc}\n";
      # print causes "Wide character" warnings (ignorable)
    close $fh;
}

# now read back the test data, and see how guess_encoding works

for my $chk ( sort keys %testStrings ) {
    open my $fh, "<", "test.$chk.txt"
        or die "$chk: open failed for read: $!";
    {
        local $/;   # undef input record separator for "slurp" mode
        $_ = <$fh>;
    }
    close $fh;

    my $enc = guess_encoding( $_ );
    if ( ! ref( $enc )) {
        warn "guess_encoding failed on $chk data\n";
        next;
    }
    warn "Guessed ".$enc->name." for the $chk file\n";

    my $chkstr = $enc->decode( $_ );  # convert file data back to utf8
    $chkstr =~ s/[\r\n]+//;

    my $result = ( $chkstr eq $testStrings{$chk} ) ? "matched" : "failed";
    warn "Comparing original $chk to ".$enc->name." $result\n";
}

__END__


<Prev in Thread] Current Thread [Next in Thread>