perl-unicode

Encode and CGI

2002-01-29 10:14:08
I changed the subject to be more appropriate.

On 2002.01.30, at 01:27, Nick Ing-Simmons wrote:
In an ideal world CGI.pm (which is also bundled with perl these days)
will have done any _utf8_on() magic that is required - usually
by looking at the charset attribute of the media type and then
calling Encode::decode() to convert data into perl's internal form.
Likewise other CGI assist modules should do likewise - what they need
is a well defined Encode module that allows them to do what the standards
say without having to re-invent everything themselves.

It is up to Lincoln Stein to decide which way to go but the demand to keep CGI.pm compatible to older version of Perl should be so high that we should not count on that. Or one should implement more modern version thereof under a different name space, as Lincoln admits that.

So the CGI scripter just has to work with perl's strings (encoded as
perl sees fit), and then just "hint" (if necessary) to CGI module how
it should be encoded for transport back. I would expect the CGI.pm code
to make sensible choices without hints in most cases - e.g. reply
in same encoding as request was received in.

That "same encoding" is somewhat problematic especially when Japanese is involved. One problem that later version of CGI.pm caused was exactly that (I forgot which version it was). Before the change charset= part of Content-Type: was not sent so it was up to HTML body to tell the browser which charset to use. Now charset="ISO-8859-1" is appended by default while the users of CGI.pm keep sending in Shift JIS, EUC or ISO-2022-JP. Actually charset is in Japan has gotten even more complicated when NTT DOCOMO introduced (in)famous i-Mode. i-Mode not only uses Shift JIS (the most popular yet most problematic charset used in Japan), they also added their own extension (mostly dingbats that are used like icons). Oh well....

But we cannot do this yet as Encode does not really support some key
MIME charsets - notably the iso2022 family of escape encodings.
I don't have the standards - they are paper copy things one buys for 」
(cannot find Yen sign on this keyboard) and may understandably be written
in Japanese - which I cannot read, nor do I have any test data.
(Other than the piles of assumed-Chinese SPAM that I seem to accumulate - but
I don't know that is "valid".)

Right. We need more testers on that. Japanese charsets I know but others I don't know much.

That is the ideal - well formed HTTP requests. We also need to handle legacy
stuff and "guess" appropriately. But it seems to me that until we have
a solution (with acceptable performance) to the well formed case,
it is pointless to worry about the "guess" case.

I think "guess" case is needed only for Japanese. Other CJK situation is not this complicated. Usually "Legacy + UTF8" (That is, GB2312 or UTF8 for Simplified Chinese, for instance).

I will gladly re-word the sections you consider misleading and
check and correct if necessary the ones you consider false.
Can you give me a list as you spot them?

  I will.

So long as you check your facts as you go that is a welcome contribution.
But do not for example suggest "you should always do _utf8_on() before
calling encode()" because it isn't true.

No I won't but at the same time I still don't know when to and when not to. I think we need more working example before we come up with an idiom....

Dan

<Prev in Thread] Current Thread [Next in Thread>