Re: Question regarding Unicode handling in perl: auto-sensing

Andreas Jaekel <jaekel(_at_)cablecats(_dot_)de> writes:

Dear Perl Dieties!

I've been trying to figure this out for myself for a couple
of hours now, but I got to the point were I gave up and decided
that I'll have to bother you.  Hope you don't mind.

My task is the following, and I'm running out of ideas:

// what I want to do //

I want to read in a GNU tar file from STDIN.  The tar file
contains roughly 1.2 million files, and each of them is
encoded in either ASCII, UTF-8 or ISO-8859-1.  The trick is,
I don't know which file is encoded in which encoding.

So, I only have one file descriptor (the tar archive), from which
I successfully retreive each file into a scalar, one at a time,
and then I call my "guess_enconding()" subroutine.

// what I tried //

From this point on I'll describe in a few words what I already

found out, and why it didn't help me.

I found out I can set the file descriptor of the tar file to
binmode(), or open it with <:bytes.  I do that.  But
all it does it tell perl that the data is 8-bit raw.  That resolved
a few confusions, but not the final problem.

I found out how to detect ASCII.  I can do it with
      eval {
              Encode::from_to($buf, "ascii", "utf-8", Encode::FB_CROAK);
      }
      if($@) { ...

But that leaves me with knowing UTF-8 from ISO-8859-1.


or 

       if ($buf =~ /^[\0x00-\x7f]*$/)


Obviously, every UTF-8 file is also a valid ISO-8859-1 file. So my
only hope is to check for "valid UTF-8", and if that fails it has to be
ISO-8859-1.


How about 

        Encode::from_to($buf, "utf-8", "utf-8", Encode::FB_CROAK);

But that is doing a lot of work.


The "perluniintro" man page gives example code on how to do that:

      use Encode 'encode_utf8';
      if(encode_utf8($buf)) { ...


That is the wrong way round. You have raw octets you want to see 
if they are characters.
So you want to _decode_ them and see if it works.


Unfortunatly, this plain doesn't work.  The same man page mentions a
second method:

      use warnings;
      @chars = unpack("U0U*", $buf);

This WORKS (hurray!) but all I get is a warning, and I have not been
able to find any way of detecting this warning inside my script.
(short from parsing my own stderr, which would be creative, but
I'd be shot if anyone saw my code - and rightshously so)

I tried all other ways I could think of using encode(), decode(),
from_to() and unpack(). I tried Encode::FB_CROAK wherever I could.

Side note:  despite the modules documentation stating otherwise,
the function encode_utf8() would not accept a CHECK parameter.


The current docs say decode_utf8() accepts CHECK - as decode can 
get a bad sequence of octets which don't make chars.
But encode cannot fail. If I have chars I _can_ encode them as UTF-8.


So, I think it boils down to: "Is this string valid UTF-8?  The
methods given in the module documentations and man pages do no
seem to work for me."

I would appreciate any help.

Thank you!

Regards,
   Andy.