perl-unicode

Re: Encode::Tcl Mistery Solved!

2002-01-29 09:28:40
Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com> writes:

[The 'E' key on my keyboard just broke :-{]


Now mended.


Again last time I looked HTTP expected there to be a
Content-Transfer-Encoding:

Right idea, wrong keyword, I meant

Content-Type: ... charset=

RFC2616 (HTTP) Secion 3.7.1

"
   The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value. See
   section 3.4.1 for compatibility problems.
"

Perl's default behaviour is designed to be compatible with that requirement.

As for errors we
should give the caller decide how to handle it.

We can provide both.

* Charset detection/guessing is still missing
  --  Imperative for the use on network

So please donnate code which given an octet stream returns a string suggesting
its encoding name...

* UTF-8 flag control is still limited and hairy
  -- line discipline and _utf8_on()
  -- how about param() of CGI module, for instance?

In an ideal world CGI.pm (which is also bundled with perl these days)
will have done any _utf8_on() magic that is required - usually
by looking at the charset attribute of the media type and then
calling Encode::decode() to convert data into perl's internal form.
Likewise other CGI assist modules should do likewise - what they need
is a well defined Encode module that allows them to do what the standards
say without having to re-invent everything themselves.

So the CGI scripter just has to work with perl's strings (encoded as
perl sees fit), and then just "hint" (if necessary) to CGI module how
it should be encoded for transport back. I would expect the CGI.pm code
to make sensible choices without hints in most cases - e.g. reply
in same encoding as request was received in.

But we cannot do this yet as Encode does not really support some key
MIME charsets - notably the iso2022 family of escape encodings.
I don't have the standards - they are paper copy things one buys for \xA3
(cannot find Yen sign on this keyboard) and may understandably be written
in Japanese - which I cannot read, nor do I have any test data.
(Other than the piles of assumed-Chinese SPAM that I seem to accumulate - but
I don't know that is "valid".)

That is the ideal - well formed HTTP requests. We also need to handle legacy
stuff and "guess" appropriately. But it seems to me that until we have
a solution (with acceptable performance) to the well formed case,
it is pointless to worry about the "guess" case.
(Given infinite speed guess could just be try all possible encodings and choose
one with least failures...)

* Better and more correct documentation
  -- One of the reasons I split the module.  I wanted perldoc
     to say something when "perldoc Encode::Encoding".  "perldoc
Encoding"
      is too long and with too many misleading, if not false, remarks.

I will gladly re-word the sections you consider misleading and
check and correct if necessary the ones you consider false.
Can you give me a list as you spot them?


  These are going to be area I will work on.

So long as you check your facts as you go that is a welcome contribution.
But do not for example suggest "you should always do _utf8_on() before
calling encode()" because it isn't true.

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/