I am trying to decode strings of suspect UTF origins - Encode::Guess
seems to be the way to go....
So I am opening a file "normally" and just reading line by line. I will
pass the line of text through Encode::Guess which I have used thusly:
use Encode::Guess qw(UTF-8 UTF-16BE); #I may add more in future
Now what I read in is *usually* UTF-8 and all is good. But if a
UTF-16BE string comes along here is what happens:
Encode/Guess.pm: 92
DB<2> x $octet
0
"\c(_at_)B\c@E\c(_at_)G\c@I\c(_at_)N\c@:\c(_at_)V\c@C\c(_at_)A\c@R\c(_at_)D\c@\cM\c(_at_)\cJ"
Encode/Guess.pm: 94
DB<3> x $line
0
"\c(_at_)B\c@E\c(_at_)G\c@I\c(_at_)N\c@:\c(_at_)V\c@C\c(_at_)A\c@R\c(_at_)D\c@"
*NOW* when it is testing the decode of a UTF-16BE string it will
_always_ come up one byte short and will never match a successful
decode even though that is what it really is.
we should have :
0 "\c(_at_)B\c@E\c(_at_)G\c@I\c(_at_)N\c@:\c(_at_)V\c@C\c(_at_)A\c@R\c(_at_)D"
changing the split to include "\000+" in the split fixes this problem.
But it would break for UTF-16LE, right?
Points
- what is the best way to open and read data that might be: UTF-8,
UTF-16, UTF-16BE, or UTF-16LE?
- is there a good way to chop the line endings reliably for the above
4 sets?
- maybe detecting the flavour of unicode is better left to a different
process?
Encode::Guess::Unicode?
Plz advise - perhaps just documentation expansion is necessary and can
help w/ that based on this matter.
Jay