perl-unicode

Re: UTF-16LE fails in substitution

2005-09-16 17:44:05
"Dan Kogai" <dankogai(_at_)dan(_dot_)co(_dot_)jp> wrote in message
news:0E3FD654-FCD0-4B7D-945A-D921BD19A886(_at_)dan(_dot_)co(_dot_)jp(_dot_)(_dot_)(_dot_)
On Sep 15, 2005, at 07:05 , Steve Larson wrote:

What I want to do is add a version string comment at the beginning
of .xml
files.  I test to see if the file is UNICODE (Encode::Unicode) or
ASCII
(Encode::XS) using guess_encoding.  My ASCII case works fine but
the regexp
for the UNICODE case fails.  Below snippet is the code for the
UNICODE case.

The answer is that PerlIO does not go well with BOMed UTFs.  What you
should do instead is to read the whole file first like this;

open my $in, "<:raw", $filename or die "$filename : $!";
read $in, my $buf, -s $filename; # one of many ways to slurp file.
close $in;
my $content = decode("UTF16", $buffer); # LE or BE is not required.
#
# do whatever you want to $content and....
#
open my $out, ":>raw", $filename or die "$filename : $!";
print $out encode("UTF16-LE", $buffer); # now be explicit on endianness
close $out;

Remember UTF-(16|32) does not go well with stream models.  Treat it
as a binary file.

Dan the Encode Maintainer

Thanks Dan.
I still get no BOM when using UTF-16LE for output (using UTF16 I get a BOM
and BE output).  I need UTF-16LE byte order with a BOM just like the input
when the input has a BOM.  I also get 0x0A for \n instead of the 0x0A 0x0D I
should be getting.

When I run across a file without a BOM with "...decode(UTF16,$buffer)", I
get an error "UTF-16:Unrecognised BOM 3c00 at..." on reading the file in.
So UTF16 is appearenly looking for a BOM.
With "...decode(UTF-16LE,$buffer)" and "$content =~
s/(\x{fffe}(<\?.*?\?>)*)\n?/\n<!-- Build Ver..." I am able to read in files
with or without a BOM.  However, I get a warning "Unicode character 0xfffe
is illegal at ..." and the BOM (when it exists) does not stay at the
beginning of the file.

It still seems like I am not understanding something (may be basic) about
processing UTF files.  I have read through the related docs in the help
several times and the behavior seems to be the opposite in several cases.
Any suggestions?

Steve


<Prev in Thread] Current Thread [Next in Thread>