perl-unicode

Re: Encode::Tcl Mistery Solved!

2002-01-29 06:48:35
On 2002.01.29, at 21:52, Nick Ing-Simmons wrote:
my $utf8_file = "t/table.utf8"; # Valid UTF8 text file
my $utf8_data;
open my $fh, $utf8_file or die "$utf8_file:$!";

That is supposed to be :

open my $fh,"<:utf8", $utf8_file;

To tell perl that data is UTF-8.

The gotta be more ways to tell perl which scalar is UTF-8. How about CGIs? You can't make PerlIO the standard way of setting UTF-8 flag.


  The answer is:  $utf8 is no utf8 UNLESS YOU EXPLICITLY SPECIFY
SOMEWHERE!

Yes - things are sequences of iso-8859-1 until told otherwise.

  Or chunk of bytes or whatsoever unless UTF-8 flag is set.

        open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!";

Which is the prefered way.

iff you know the coding a priori, which is usually not the case of CGIs and many other.


  Encoding engines themselves appears ok.
  I repeat. ENCODING ENGINES THEMSELVES APPEARS OK!

I think we knew that ;-)

Well they are okay in terms that they convert. They are still too slow. The example above on my FreeBSD box, Pentium III 800 MHz and 512MB RAM took some two seconds to show the result (Its performance is not too bad once the internal table is full). Performance and memory usage is still an issue. But I am darn glad to see it does what it is supposed to do, however inefficient it may be.

  If encode() demands an SV explicitly marked as UTF8, it should carp
BEFORE it attempts to encode from the first place.

It doesn't. If it is not marked as UTF-8 it assumes it isn't. So
(Jarkko's locale stuff aside) it is a sequence of iso-8859-1 chars
for legacy compatibility. You then ask it to convert those bytes to
EUC-JP and lots of high-bit iso-8859-1's (which is what UTF8 encoded
stuff looks like) don't map so you get undefs.

  On what occasion legacy compatibility is required for encode()?

Back to locale ... The idea of the locale stuff is to say "aha - user is in a Japanese locale
so in absence of instructions to the contrary I will assume that files
are full of iso2022-jp encoded stuff" (or whatever is right thing).
So you will still need to explicitly tell it when you are breaking
that assumption.

However encode($encoding, $string, $check) does assume $string to be UTF-8 marked string. At least that was what POD was saying.


  There are other places where croak() that should carp() but I'll wait
next breadperl to commit these changes.

The idea of the croak is you can catch it silently with

eval { $string = decode($trythis,... }
(or better yet call find_encoding yourself before getting that far).

Are you going to tell million of novice CGI users/writer to use eval for error handling? Exception handling via eval is way too much for a module like this which will be used very frequently.

The carp is going to leak out to the user and look messy.

So my suggestion is to silently return undef and set $Encode::Error or whatever. I HATE TO USE EVAL TO CATCH EXCEPTIONS! As for errors we should give the caller decide how to handle it.

  So much as I feel relieved now, I still feel uncomfortable on the API
of Encode.  UTF8 flag must be explicitly set yet the use of _utf8_on()
is depreciated.

Yes you are supposed to set it on the file handle. Setting it on
may be appropriate if data comes in magically from somewhere else.

  Once again file handle is one of too many ways to get data from and to.
  Let's summarize.

* Encode::Tcl works but with care and time
  --  New module for CJK is till called for
      -- big yikes for the use by mod_perl and such
* Charset detection/guessing is still missing
  --  Imperative for the use on network
* UTF-8 flag control is still limited and hairy
  -- line discipline and _utf8_on()
  -- how about param() of CGI module, for instance?
* Better and more correct documentation
  -- One of the reasons I split the module.  I wanted perldoc
to say something when "perldoc Encode::Encoding". "perldoc Encoding"
         is too long and with too many misleading, if not false, remarks.

  These are going to be area I will work on.

Dan the Man with Too Many Charsets to Handle