Re: Encode::Tcl Mistery Solved!

Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:

Last time I looked CGI communicated via Sockets, and Sockets are IO.


  CGI module itself reads via PerlIO, right.  But most users fetch the
result via param().  CGI module decodes URL escapes (or mime-coded
queries such as file uploads).  Now you have to tell in what character
set param() returns.


No you don't - by this point we are the perl world so we know what
encoding it is - perl's internal one.

There the real (en|de)coding business is done
AFTER the PerlIO phase is over.

You can't make PerlIO the standard way of setting UTF-8 flag.


I not only _can_ I _have_ (made it the _standard_ way) - ;-)


  And force billions of open() in existing codes to be rewritten?


No. We have chosen the defaults so that almost all of them can stay the
way they are (we think).


  Not yet.  I'll work on it.  But *.ucm for CJK is not likely;   It
simply gets TOO BIG.


Which while ext/Encode/compile exists - to convert .ucm into managable binary.

One of the alternatives I am thinking is the
approach mentioned in Encode::XS.

If I had _ANY_ test data I would run the compiled test and give you
the comparative number.


  You can use t/table.euc under Jcode module for instance.  table.utf8
in my code example is just a utf8 version thereof. That's a data which
contains all characters defined in EUC (well, actually JISX0212 is not
included but very few environments can display JISX0212).


Excellent!

  On what occasion legacy compatibility is required for encode()?


Encode is built on perl. It takes perl strings. Perl has legacy reason
to treat strings a certain way. Encode just works with what it gets.


  To me perl has no 'string'.  It just an PV that happens to store
strings.


But it does store strings - i.e. sequences of characters in perl's
internal set. XS code in perl5.6+ has to be careful to remember that
and not just blindly assume that a byte is a byte - but that is another story.

In the age of Unicode we have to be careful on not only the
term 'char' but 'string' as well....


Yes.

As for need for legacy and Encode it makes sense for europeans and
americans
for the things we type to interpreted as iso8859-1 - if we then
ask for those strings to be encoded in iso2022 or Big5 then that is
a sensible thing for encode() to do - if only to put in the \x1b...
escapes...


  Right.  Very fortunately Tcl table does preserve ascii while the
original table by Unicode consortium did not.

"        $bytes  = encode(ENCODING, $string[, CHECK])

Encodes string from Perl's internal form into I<ENCODING> and returns
a sequence of octets.  For CHECK see L</"Handling Malformed Data">.


  Yes, that "Perl's internal form" was the key.  We should be more
explicit on that.

"Perl's internal form" means exactly what it says. It _may_ be UTF-8
encoded or as raw bytes (on mainframes it may be UTF-EBCDIC encoded).
encode takes that form in its full glory SvUTF8 mode bits and all
and converts it to the specified encoding.


  The problem is that you have to make sure if $string is either UTF8 or
ascii or totally unexpected results like I showed in my previous
articles.


No. Not "unexpected" -  exactly what it specified to do.
all the 256 possible values have a defined meaning in the non-UTF-8 case.
(Which is iso8859-1 on ASCII machines and NativeEBCDIC on EBCDIC mainframes.)

  Are you going to tell million of novice CGI users/writer to use eval
for error handling?


A. Yes - they really should.
B. No  - I am expecting perl5-unicode(_at_)perl(_dot_)org folk to give them 
an
   Encode module which can "Do What I Mean".
   That _Module_ should do the eval {} if necessary.


  IMHO eval is abused as an exception handler.  It is after all eval,
not try and catch.


Let us not go there - the debate in the perl5-porters archives.

eval {} is quite cheap - certainly a lot cheeper than lots of
if ($Encode::error) test.


  Or is it?  I pretty much doubt if recompilation


I said eval {} not eval "" - no recompilation that at all.

is cheaper than
assigning to SV.


But eval {} is comparable to setting SV and then testing the SV.


  Should we implement Encode::Carp like CGI::Carp ?

No.


  Or I love eval and that is one of the big reasons why I use perl.  At
the same time I know the cost thereof.  eval is so versatile that it is
too heavy for most cases.  If your statement is true, why don't we

eval{ open FH, "<file"; }; die "Can't open file: $@" if $@ ?

  Instead of ever-popular idiom;

open FH, "<file" or die "Can't open file: $!"


That is a 'throw' not a 'catch'. You do it that way precisely so an
outer eval {} can 'catch' it. If open had been designed to throw in the
first place one would just write:

  open FH, "<file"); # will automaticaly die with right message.

for the normal die case - in those few cases where one writes

  if (open(FH,...) {
  }
  else {
  }

One would write

  eval { open(FH,...) }
  unless ($@) {
  }
  else {
  }

But it is not a fair comparison - files get miss named far more often
than guess_encoding() should try for a non-existing character encoding.

As for errors we should give the caller decide how to handle it.


We can provide both.


  We definitely should.  But to what extent is a good question.
Encode::Carp ?


Not if we can avoid it.

So please donnate code which given an octet stream returns a string
suggesting
its encoding name...


  I definitely will.

Dan

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/