Re: RFC 1522 support in MHonArc

(ccing the list since it may be of interest)

I'd like to thank Rune for the patch, but unfortunately the quick
fix was inadequate for proper (generic) 1522 support in mhonarc.


Appart from the fact that it only works with iso-8859-1, what are the
limitations that you're refering to?


The problem in a conversion standpoint is that you lose the reference
to which charset the data is encoded in.  The patch you gives decodes
the data, but then mhonarc will have no information on how to properly
convert the data into HTML since the charset information is lost.
Mhonarc would not know if it was dealing with iso-8859-1, iso-8859-2,
japanese charset, etc.  Character 0xA5 can represent alot of different
things.

In order to facilitate proper conversion, the charset information must
be preserved in some way to allow mhonarc (or any other filter using
readmail.pl) to properly convert the data.  Hence, I have taken an
approach similiar to 1521 processing.  If performing 1522 processing,
one must register functions by charset to readmail.pl inorder to
tell readmail.pl on how to properly convert the data according to
the needs of the application (prototype of callback functions shown
in my first message).

The other issue is 1522 processing should be optional.  In some cases,
1522 processing can not be properly done due to environmental issues,
or there is no simple mapping from the charset to the destination
format.  For example, I may want to convert iso-8859 charsets, but
not japanese charsets since my system cannot deal with them and/or
represent them.

A mhonarc related issue is the problem of storing the data in the
database.  Since decoded charsets may have characters that can cause
problems in generating proper Perl code (remember, the database is just
a Perl program), especially for multi-byte charsets.  The approach
I'm taking with mhonarc is to store the data in its original 1522
encoded form and convert when deemed necessary.

I guess that the simplest way to perform the character-set translation
is by using an array for converting each charset to some universal
format, and perform the inverse operation to get the characters to the
format that you want (guess you probably already knew this).


Defining a universal format is very difficult.  Unicode is an attempt
at it, but it still does not cover all possible languages.  I'm not
in the business to define the all encompassing charset.  Also, it is
nearly impossible to find all information on every charset that exists
and define an all-encompassing format the can represent them all.
Alot to ask for someone to do on their spare time.

On a design point of view, it is simplier to rely on callback routines
to perform the conversion.  It is easier (and quicker) to implement
and support, and it gets the job done.  Plus, it gives a lot of
flexibility to the user to customize the filtering process.

        --ewh

----
    Earl Hood                  |   ISOGEN INTERNATIONAL CORP
    ehood(_at_)isogen(_dot_)com           |   dba Highland Consulting
    Phone: 214-953-0004 x127   |   2200 North Lamar #230
    FAX: 214-953-3152          |   Dallas, TX  75202