Re: Encode::Tcl Mistery Solved!

On 2002.01.29, at 23:46, Nick Ing-Simmons wrote:

  The gotta be more ways to tell perl which scalar is UTF-8.


There are - but they are discoraged.

The point is I don't want to see yet another 'discouraged butfrequently used' idiomatic phrases.

How about CGIs?


Last time I looked CGI communicated via Sockets, and Sockets are IO.

CGI module itself reads via PerlIO, right. But most users fetch theresult via param(). CGI module decodes URL escapes (or mime-codedqueries such as file uploads). Now you have to tell in what characterset param() returns. There the real (en|de)coding business is doneAFTER the PerlIO phase is over.

You can't make PerlIO the standard way of setting UTF-8 flag.


I not only _can_ I _have_ (made it the _standard_ way) - ;-)

And force billions of open() in existing codes to be rewritten? Andforce the use of binmode() to Unix users?

We know there are exceptions though so there is the back door.


  Like _utf8_on().

  iff you know the coding a priori, which is usually not the case of
CGIs and many other.


Again last time I looked HTTP expected there to be a
Content-Transfer-Encoding:

The sad fact is that the use of web at least in Japan started BEFOREthis header appears. As a result so many web browsers do not trust thisheader. One time I needed to downgrade CGI.pm because it startedsupporting this header.Check 'Character Set' menu of any web browser. You can always find'Japanese (Auto Detect)' there. If Content-Transfer-Encoding: weretrustworthy enough this 'Auto Detect' item would not have been existedfrom the first place.

  Well they are okay in terms that they convert.  They are still too
slow.


Encode::Tcl is too slow - even for 8-bit - which is why I wrote the
engine which works from the "compiled" form.

Have you tried using ext/Encode/compile to build an XS module for
EUC ?

Not yet. I'll work on it. But *.ucm for CJK is not likely; Itsimply gets TOO BIG. One of the alternatives I am thinking is theapproach mentioned in Encode::XS.

If I had _ANY_ test data I would run the compiled test and give you
the comparative number.
You can use t/table.euc under Jcode module for instance. table.utf8in my code example is just a utf8 version thereof. That's a data whichcontains all characters defined in EUC (well, actually JISX0212 is notincluded but very few environments can display JISX0212).

  On what occasion legacy compatibility is required for encode()?


Encode is built on perl. It takes perl strings. Perl has legacy reason
to treat strings a certain way. Encode just works with what it gets.

To me perl has no 'string'. It just an PV that happens to storestrings. In the age of Unicode we have to be careful on not only theterm 'char' but 'string' as well....

As for need for legacy and Encode it makes sense for europeans andamericans

for the things we type to interpreted as iso8859-1 - if we then
ask for those strings to be encoded in iso2022 or Big5 then that is
a sensible thing for encode() to do - if only to put in the \x1b...
escapes...

Right. Very fortunately Tcl table does preserve ascii while theoriginal table by Unicode consortium did not.

"        $bytes  = encode(ENCODING, $string[, CHECK])

Encodes string from Perl's internal form into I<ENCODING> and returns
a sequence of octets.  For CHECK see L</"Handling Malformed Data">.

Yes, that "Perl's internal form" was the key. We should be moreexplicit on that.

"Perl's internal form" means exactly what it says. It _may_ be UTF-8
encoded or as raw bytes (on mainframes it may be UTF-EBCDIC encoded).
encode takes that form in its full glory SvUTF8 mode bits and all
and converts it to the specified encoding.

The problem is that you have to make sure if $string is either UTF8 orascii or totally unexpected results like I showed in my previousarticles.

  Are you going to tell million of novice CGI users/writer to use eval
for error handling?


A. Yes - they really should.
B. No  - I am expecting perl5-unicode(_at_)perl(_dot_)org folk to give them an
   Encode module which can "Do What I Mean".
   That _Module_ should do the eval {} if necessary.

IMHO eval is abused as an exception handler. It is after all eval,not try and catch.

eval {} is quite cheap - certainly a lot cheeper than lots of
if ($Encode::error) test.

Or is it? I pretty much doubt if recompilation is cheaper thanassigning to SV. I once wrote a module that does something likeHTML::Template today for internal use. It was at first implementedusing eval{} but replaced with other methods because of the cost.I agree it is not prohibitively expensive in terms of performance andmemory usage but it is pricey enough for the use by mod_perl, forinstance.

However I agree - one should not use eval {}
as a subsitute for sensible coding at the layer above - like doing
a find_encoding yourself.


  Should we implement Encode::Carp like CGI::Carp ?

So my suggestion is to silently return undef and set $Encode::Erroror
whatever.  I HATE TO USE EVAL TO CATCH EXCEPTIONS!
Well that is sad, because it is the way the perl core works and Encodeis
a core module and is likely to stay that way.

Or I love eval and that is one of the big reasons why I use perl. Atthe same time I know the cost thereof. eval is so versatile that it istoo heavy for most cases. If your statement is true, why don't we


eval{ open FH, "<file"; }; die "Can't open file: $@" if $@ ?

  Instead of ever-popular idiom;

open FH, "<file" or die "Can't open file: $!"

As for errors we should give the caller decide how to handle it.


We can provide both.

We definitely should. But to what extent is a good question.Encode::Carp ?

So please donnate code which given an octet stream returns a stringsuggesting
its encoding name...


  I definitely will.

Dan