perl-unicode

Re: Encode::Tcl Mistery Solved!

2002-01-29 09:32:00
On 2002.01.29, at 23:46, Nick Ing-Simmons wrote:
  The gotta be more ways to tell perl which scalar is UTF-8.

There are - but they are discoraged.

The point is I don't want to see yet another 'discouraged but frequently used' idiomatic phrases.

How about CGIs?

Last time I looked CGI communicated via Sockets, and Sockets are IO.

CGI module itself reads via PerlIO, right. But most users fetch the result via param(). CGI module decodes URL escapes (or mime-coded queries such as file uploads). Now you have to tell in what character set param() returns. There the real (en|de)coding business is done AFTER the PerlIO phase is over.

You can't make PerlIO the standard way of setting UTF-8 flag.

I not only _can_ I _have_ (made it the _standard_ way) - ;-)

And force billions of open() in existing codes to be rewritten? And force the use of binmode() to Unix users?

We know there are exceptions though so there is the back door.

  Like _utf8_on().

  iff you know the coding a priori, which is usually not the case of
CGIs and many other.

Again last time I looked HTTP expected there to be a
Content-Transfer-Encoding:

The sad fact is that the use of web at least in Japan started BEFORE this header appears. As a result so many web browsers do not trust this header. One time I needed to downgrade CGI.pm because it started supporting this header. Check 'Character Set' menu of any web browser. You can always find 'Japanese (Auto Detect)' there. If Content-Transfer-Encoding: were trustworthy enough this 'Auto Detect' item would not have been existed from the first place.

  Well they are okay in terms that they convert.  They are still too
slow.

Encode::Tcl is too slow - even for 8-bit - which is why I wrote the
engine which works from the "compiled" form.

Have you tried using ext/Encode/compile to build an XS module for
EUC ?

Not yet. I'll work on it. But *.ucm for CJK is not likely; It simply gets TOO BIG. One of the alternatives I am thinking is the approach mentioned in Encode::XS.

If I had _ANY_ test data I would run the compiled test and give you
the comparative number.

You can use t/table.euc under Jcode module for instance. table.utf8 in my code example is just a utf8 version thereof. That's a data which contains all characters defined in EUC (well, actually JISX0212 is not included but very few environments can display JISX0212).

  On what occasion legacy compatibility is required for encode()?

Encode is built on perl. It takes perl strings. Perl has legacy reason
to treat strings a certain way. Encode just works with what it gets.

To me perl has no 'string'. It just an PV that happens to store strings. In the age of Unicode we have to be careful on not only the term 'char' but 'string' as well....

As for need for legacy and Encode it makes sense for europeans and americans
for the things we type to interpreted as iso8859-1 - if we then
ask for those strings to be encoded in iso2022 or Big5 then that is
a sensible thing for encode() to do - if only to put in the \x1b...
escapes...

Right. Very fortunately Tcl table does preserve ascii while the original table by Unicode consortium did not.

"        $bytes  = encode(ENCODING, $string[, CHECK])

Encodes string from Perl's internal form into I<ENCODING> and returns
a sequence of octets.  For CHECK see L</"Handling Malformed Data">.

Yes, that "Perl's internal form" was the key. We should be more explicit on that.

"Perl's internal form" means exactly what it says. It _may_ be UTF-8
encoded or as raw bytes (on mainframes it may be UTF-EBCDIC encoded).
encode takes that form in its full glory SvUTF8 mode bits and all
and converts it to the specified encoding.

The problem is that you have to make sure if $string is either UTF8 or ascii or totally unexpected results like I showed in my previous articles.

  Are you going to tell million of novice CGI users/writer to use eval
for error handling?

A. Yes - they really should.
B. No  - I am expecting perl5-unicode(_at_)perl(_dot_)org folk to give them an
   Encode module which can "Do What I Mean".
   That _Module_ should do the eval {} if necessary.

IMHO eval is abused as an exception handler. It is after all eval, not try and catch.

eval {} is quite cheap - certainly a lot cheeper than lots of
if ($Encode::error) test.

Or is it? I pretty much doubt if recompilation is cheaper than assigning to SV. I once wrote a module that does something like HTML::Template today for internal use. It was at first implemented using eval{} but replaced with other methods because of the cost. I agree it is not prohibitively expensive in terms of performance and memory usage but it is pricey enough for the use by mod_perl, for instance.

However I agree - one should not use eval {}
as a subsitute for sensible coding at the layer above - like doing
a find_encoding yourself.

  Should we implement Encode::Carp like CGI::Carp ?

So my suggestion is to silently return undef and set $Encode::Error or
whatever.  I HATE TO USE EVAL TO CATCH EXCEPTIONS!

Well that is sad, because it is the way the perl core works and Encode is
a core module and is likely to stay that way.

Or I love eval and that is one of the big reasons why I use perl. At the same time I know the cost thereof. eval is so versatile that it is too heavy for most cases. If your statement is true, why don't we

eval{ open FH, "<file"; }; die "Can't open file: $@" if $@ ?

  Instead of ever-popular idiom;

open FH, "<file" or die "Can't open file: $!"

As for errors we should give the caller decide how to handle it.

We can provide both.

We definitely should. But to what extent is a good question. Encode::Carp ?

So please donnate code which given an octet stream returns a string suggesting
its encoding name...

  I definitely will.

Dan

<Prev in Thread] Current Thread [Next in Thread>