perl-unicode

Re: Encode::Tcl Mistery Solved!

2002-01-29 07:47:17
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
On 2002.01.29, at 21:52, Nick Ing-Simmons wrote:
my $utf8_file = "t/table.utf8"; # Valid UTF8 text file
my $utf8_data;
open my $fh, $utf8_file or die "$utf8_file:$!";

That is supposed to be :

open my $fh,"<:utf8", $utf8_file;

To tell perl that data is UTF-8.

  The gotta be more ways to tell perl which scalar is UTF-8.

There are - but they are discoraged.

How about
CGIs?

Last time I looked CGI communicated via Sockets, and Sockets are IO.

You can't make PerlIO the standard way of setting UTF-8 flag.

I not only _can_ I _have_ (made it the _standard_ way) - ;-)

We know there are exceptions though so there is the back door.

    open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!";

Which is the prefered way.

  iff you know the coding a priori, which is usually not the case of
CGIs and many other.

Again last time I looked HTTP expected there to be a
Content-Transfer-Encoding:



  Encoding engines themselves appears ok.
  I repeat. ENCODING ENGINES THEMSELVES APPEARS OK!

I think we knew that ;-)

  Well they are okay in terms that they convert.  They are still too
slow.

Encode::Tcl is too slow - even for 8-bit - which is why I wrote the
engine which works from the "compiled" form.

Have you tried using ext/Encode/compile to build an XS module for
EUC ?

The example above on my FreeBSD box, Pentium III 800 MHz and
512MB RAM took some two seconds to show the result (Its performance is
not too bad once the internal table is full).

If I had _ANY_ test data I would run the compiled test and give you
the comparative number.

  If encode() demands an SV explicitly marked as UTF8, it should carp
BEFORE it attempts to encode from the first place.

It doesn't. If it is not marked as UTF-8 it assumes it isn't. So
(Jarkko's locale stuff aside) it is a sequence of iso-8859-1 chars
for legacy compatibility. You then ask it to convert those bytes to
EUC-JP and lots of high-bit iso-8859-1's (which is what UTF8 encoded
stuff looks like) don't map so you get undefs.

  On what occasion legacy compatibility is required for encode()?

Encode is built on perl. It takes perl strings. Perl has legacy reason
to treat strings a certain way. Encode just works with what it gets.

As for need for legacy and Encode it makes sense for europeans and americans
for the things we type to interpreted as iso8859-1 - if we then
ask for those strings to be encoded in iso2022 or Big5 then that is
a sensible thing for encode() to do - if only to put in the \x1b...
escapes...


  However encode($encoding, $string, $check) does assume $string to be
UTF-8 marked string.  At least that was what POD was saying.

"        $bytes  = encode(ENCODING, $string[, CHECK])

Encodes string from Perl's internal form into I<ENCODING> and returns
a sequence of octets.  For CHECK see L</"Handling Malformed Data">.

"

"Perl's internal form" means exactly what it says. It _may_ be UTF-8
encoded or as raw bytes (on mainframes it may be UTF-EBCDIC encoded).
encode takes that form in its full glory SvUTF8 mode bits and all
and converts it to the specified encoding.

It is a great pity in hind sight that the next paragraph says:

"
For example to convert (internally UTF-8 encoded) Unicode data
to octets:

        $octets = encode("utf8", $unicode);

"

That is just an example. (I happens to be an example which gave
people that wanted the UTF-8 for the internal form a lot of grief
so it was spelt out in the POD.)

It would be better phrased as :

For example to convert $unicode (however it happens to be internally) to
octets of its UTF-8 encoding:

        $octets = encode("utf8", $unicode);


  Are you going to tell million of novice CGI users/writer to use eval
for error handling?

A. Yes - they really should.
B. No  - I am expecting perl5-unicode(_at_)perl(_dot_)org folk to give them an
   Encode module which can "Do What I Mean".
   That _Module_ should do the eval {} if necessary.

Exception handling via eval is way too much for a
module like this which will be used very frequently.

eval {} is quite cheap - certainly a lot cheeper than lots of

if ($Encode::error) test.

However I agree - one should not use eval {}
as a subsitute for sensible coding at the layer above - like doing
a find_encoding yourself.


The carp is going to leak out to the user and look messy.

  So my suggestion is to silently return undef and set $Encode::Error or
whatever.  I HATE TO USE EVAL TO CATCH EXCEPTIONS!

Well that is sad, because it is the way the perl core works and Encode is
a core module and is likely to stay that way.

As for errors we
should give the caller decide how to handle it.

We can provide both.

* Charset detection/guessing is still missing
  --  Imperative for the use on network

So please donnate code which given an octet stream returns a string suggesting
its encoding name...

[The 'E' key on my keyboard just broke :-{]

* UTF-8 flag control is still limited and hairy
  -- line discipline and _utf8_on()
  -- how about param() of CGI module, for instance?
* Better and more correct documentation
  -- One of the reasons I split the module.  I wanted perldoc
     to say something when "perldoc Encode::Encoding".  "perldoc
Encoding"
       is too long and with too many misleading, if not false, remarks.

  These are going to be area I will work on.

Dan the Man with Too Many Charsets to Handle
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/