perl-unicode

Re: Encode::MIME::Header my 2¢

2002-10-06 20:30:05
On Monday, Oct 7, 2002, at 06:14 Asia/Tokyo, Nick Ing-Simmons wrote:
I have re-started work on Unicode aware perl/Tk - and I am playing
with it in "tkmail" (as a test app). Obviously Encode::MIME is just
the thing for a mail tool.

However the encode ops are not ideal:

I know it is not and one of the reasons it is not is that it has to follow Encode API. MIME Header encoding in its essence is double encoding so it lacks minute controls that you may want.

If you encode('MIME-Header',...) it _seems_ to always use the 'B' form
maybe my tests are not extensive enough.

That one is documented already;

perldoc Encode::MIME::Header
ABSTRACT
       This module implements RFC 2047 Mime Header Encoding.
       There are 3 variant encoding names; "MIME-Header",
       "MIME-B" and "MIME-Q".  The difference is described below

                     decode()          encode()
         ----------------------------------------------
         MIME-Header Both B and Q      =?UTF-8?B?....?=
         MIME-B      B only; Q croaks  =?UTF-8?B?....?=
         MIME-Q      Q only; B croaks  =?UTF-8?Q?....?=

The problem is, if you need more minute controls (en|de)code needs more arguments for that but that will make ordinary (en|de)coding too hard.

If I encode('MIME-Q',...) (as currently) then I seem to get all the
' ' inside the =?UTF-8?Q?...?= and so they become =20 it also seems
to wrap all the ASCII parts too. While this is not wrong, it makes
things less readable for mail clients which don't understand and so
leave the markup for user to see.

That one I am not sure. I got mails of the opposite opinions asking for strict RFC 2047 compliance (in Jcode), especially when line folding was concerned. So I made Encode::MIME::Header RFC 2047 compliant. But I agree that =20 instead of '_' maybe too much. Nevertheless, =20 is exactly what RFC 2047 recommends;

RFC 2047
 As a consequence, unencoded white space
   characters (such as SPACE and HTAB) are FORBIDDEN within an
   'encoded-word'.  For example, the character sequence

      =?iso-8859-1?q?this is some text?=

   would be parsed as four 'atom's, rather than as a single 'atom' (by
   an RFC 822 parser) or 'encoded-word' (by a parser which understands
'encoded-words'). The correct way to encode the string "this is some
   text" is to encode the SPACE characters as well, e.g.

      =?iso-8859-1?q?this=20is=20some=20text?=

And more on "Q" Encoding

4.2. The "Q" encoding

   The "Q" encoding is similar to the "Quoted-Printable" content-
   transfer-encoding defined in RFC 2045.  It is designed to allow text
   containing mostly ASCII characters to be decipherable on an ASCII
   terminal without decoding.

   (1) Any 8-bit value may be represented by a "=" followed by two
       hexadecimal digits.  For example, if the character set in use
       were ISO-8859-1, the "=" character would thus be encoded as
       "=3D", and a SPACE by "=20".  (Upper case should be used for
       hexadecimal digits "A" through "F".)

   (2) The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
       represented as "_" (underscore, ASCII 95.).  (This character may
       not pass through some internetwork mail gateways, but its use
       will greatly enhance readability of "Q" encoded data with mail
       readers that do not support this encoding.)  Note that the "_"
       always represents hexadecimal 20, even if the SPACE character
       occupies a different code position in the character set in use.

(3) 8-bit values which correspond to printable ASCII characters other
       than "=", "?", and "_" (underscore), MAY be represented as those
       characters.  (But see section 5 for restrictions.)  In
       particular, SPACE and TAB MUST NOT be represented as themselves
       within encoded words.

With this understood,

Suggestions:
 - leave ASCII or even iso-8859-1 sequences as such

Only ASCII printable was allowed so I have to decline this one. 'MIME-Q' is already implemented that way. Bottom line is that I do not want to give up RFC 2047 conformance.

 - wrap sequences of ch > 0xff in qhichever of 'Q' or 'B' is shorter
   (do both encodings and throw one away).

I'll consider this one instead. This one at least does not breach RFC 2047.

Are patches in that direction likely to be accepted or do I build
a MIME-Smart on top ?

As I said, Encode::MIME::Header has those restrictions;

* the Encode API
* RFC 2047

This is very restrictive considering the nature of MIME Header Encoding. Surprisingly the name space Encode::MIME itself remains empty and maybe we can make use of it....

Dan the Encode Maintainer