perl-unicode

Re: UTF-16LE fails in substitution

2005-09-17 12:25:01

It might be worthwhile to investigate your UTF-16 input data file in hex
before deciding what needs to be done to read it properly in Perl.  
Presumably, if you'll have lots of files of this flavor, they'll be 
consistent in relevant details, so you only need to check one at the 
outset, to understand what's really going on.  Does the file have line 
terminations like this:

   0d 00 0a 00
   <CR>  <LF>


Also, if you are using Perl to write UTF-16 data to a file handle, you'll
only get the BOM (and only your machine's _native_ byte order) when you
specify the encoding as "UTF-16".  If you say "UTF-16LE", you override your
machine's native byte order (if necessary), and you don't get a BOM unless
you explicitly write it yourself.

As for line termination patterns on output, you probably need to control
that separately, either by setting "$\" or using the ":crlf" IO-layer.
(Are you trying to write platform-independent code, or are you just trying 
to cope with a specific plaform?)

As for the code you posted at the top of this thread, note that "\x{fffe}"
is the code point for "no such character" -- i.e. it is the one code point
that is specifically left undefined/unassigned/unused so that the BOM code
point "\x{feff}" will always work the way it is supposed to.  

The "\x{HHHH}" notation in perl refers to code points, not 16-bit encodings
of characters.  To write a correct BOM, you have to use "\x{feff}", no 
matter what your output encoding layer may be.

There are other things I would suggest changing in the code you posted, 
like improving the way error conditions are handled, using "slurp" 
mode for reading the input data, and fixing the regex substitution, which 
looks pretty broken (BOM is wrong, captured strings are deleted rather 
than being included in the substitution string).

        David Graff


<Prev in Thread] Current Thread [Next in Thread>