perl-unicode

Encode::Guess fails on UTF-16BE string w/ newline characters

2003-04-12 12:30:06

I am trying to decode strings of suspect UTF origins - Encode::Guess seems to be the way to go....

So I am opening a file "normally" and just reading line by line. I will pass the line of text through Encode::Guess which I have used thusly:

        use Encode::Guess qw(UTF-8 UTF-16BE);  #I may add more in future

Now what I read in is *usually* UTF-8 and all is good. But if a UTF-16BE string comes along here is what happens:

Encode/Guess.pm: 92
  DB<2> x $octet
0  
"\c(_at_)B\c@E\c(_at_)G\c@I\c(_at_)N\c@:\c(_at_)V\c@C\c(_at_)A\c@R\c(_at_)D\c@\cM\c(_at_)\cJ"

Encode/Guess.pm: 94
  DB<3> x $line
0  
"\c(_at_)B\c@E\c(_at_)G\c@I\c(_at_)N\c@:\c(_at_)V\c@C\c(_at_)A\c@R\c(_at_)D\c@"

*NOW* when it is testing the decode of a UTF-16BE string it will _always_ come up one byte short and will never match a successful decode even though that is what it really is.

we should have :
0  "\c(_at_)B\c@E\c(_at_)G\c@I\c(_at_)N\c@:\c(_at_)V\c@C\c(_at_)A\c@R\c(_at_)D"

changing the split to include "\000+" in the split fixes this problem. But it would break for UTF-16LE, right?

Points
- what is the best way to open and read data that might be: UTF-8, UTF-16, UTF-16BE, or UTF-16LE? - is there a good way to chop the line endings reliably for the above 4 sets? - maybe detecting the flavour of unicode is better left to a different process?
                Encode::Guess::Unicode?

Plz advise - perhaps just documentation expansion is necessary and can help w/ that based on this matter.
Jay