mhonarc-users

Re: how to change mhonarc's behaviour in qp decoding

1998-04-21 12:51:50
On April, 16 1998 at 21:43, Hasan Karahasan, DJ2xt wrote:

What you are talking about has nothing to do with quoted-printable
decoding, but character set conversion.
Imho it has to do with qp, sir. But let me give you an example from my

No it doesn't.  I suggest you review the MIME standard.
Content-Transfer-Encoding is separate from content-type.

Thegerman language has some special letters called the "umlauts".
Possibly you have ever seen an U, an O or an A with two dots above?

Yes I have.

These Umlauts are coded as 8-bit-characters. Unfortunately there are two
different ascii table mappings, where these latters appear at different
positions. One of them is the well known iso-8859-1. The other is the
ibm codepage 437.

A part of my resource file looks like:
<DECODEHEADS>
<CharsetConverters>>
---------------------^
plain;iso_8859::str2sgml;iso8859.pl
us-ascii;iso_8859::str2sgml;iso8859.pl
iso-8859-1;iso_8859::str2sgml;iso8859.pl
</CharsetConverters>>
----------------------^
Bogus '>'s.


Say we have this mail header:
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

This kind of mime declaration is often used to send the 8 bit german
characters through 7-bit-gates.

In this message we have the word:

G=81nstiger
---^^^
This is *NOT* defined by the iso-8859-1 character set.  As
I mentioned in my previous message, there are a range of characters
that are not defined (0x80 - 0x9F).

Since this falls in that range, what gets displayed is sytem
dependent.  In iso-8859-1, the u-umlauts are 0xDC (capital) and 0xFC
(lower).  Hence, the above should be:

    G=FCnstiger

Whatever composed the message is using the wrong character value.


A program that converts qp back to 8-bit chars can output a 0x81 or a
0xfc for "=81".

NO.  All a qp decoder does is translate the escape sequences to the raw
characters.  That's it.  Any other processing that may occur is based
upon content-type, and sometimes (incorrect) system specific
goofyness.

0x81 represents the U-umlaut in ibm codepage 437 (dos) and 0xfc is its
position in iso-8859-1.

Correct.  So if iso-8859-1 is the labeled character set in the
content-type, then 0xfc should have been used.

What you expect is some "magical" conversion.  How is one to know what
the original character set was when it got mislabeled in the message
header?  Another thing you should realize is that character sets do NOT
equal languages.  iso-8859-1 is used to encode characters of several
languages.  So, although one can apply a heuristic to convert 0x81 to
0xfc if it is some how known that German is the effective language, the
heuristic cannot be used in the general case of iso-8859-1 data.

So when we run the word G=81nstiger through mhonarc it changes =81 to
0x81. This is Ascii 129. The browsers ignore this char, because it is
within the unused part of the 8th bit. Remember, 0x80 through 0xa0 is
not used in iso-8859-1. The word is garbled.

I stated this is my previous message.

The str2sgml routine does also not process this 0x81. It remains
unchanged and results in missing characters in a word.

Let's sumrize:
We have for example =81 and we need &uuml; in the html file.

There are to ways to get this.

1. We can change the qp conversion, so that it outputs a 0xfc when it
gets =81. This 0xfc then will be htmlized correctly by str2sgml to its
entity.

No.  Quoted-printable *IS INDEPENDENT* of content-type.  Changing
how qp decoding is done is incorrect and may cause data to be
corrupted (ie. where =81 is meant to be 0x81).


2. We can possibly extend the table in iso8859.pl with chars

   0x84, 0x8e, 0x94, 0x99, 0x81, 0x9a and 0xe1

Maybe.  Probably better to create a new similiar conversion function
and register it.  This way your changes will not get overwritten with
future upgrades.

It should then htmlize these also correctly. I have tried this with no
success. But maybe I did something wrong. I am absolutely not familiar
with perl.

The =81 in this example becomes a 0x81, but the 0x81 does not become a
&uuml; as it should.

Should not.  The problem is with the MUA that composed the message.
Apparently, the message was composed using the ibm character set
and it incorrectly labeled the message with the iso-8859-1 character
set.

I hope you understand my problem now. I cannot believe that other
Germans have solved it, because there seems to be no way to do so except
by modifying the code. Therefore I wanted to know, where exactly qp
conversion is done.

NOT QP.  What you are looking for is a fix for broken MUAs.  You
can either get people to fix there MUAs, or develope a custom
charset converter that tries to work around the problem as mentioned
above.

BTW, if you still view that it is QP issue, I suggest moving the
discussion to the comp.mail.mime newsgroup (where you will find that it
is not a QP issue, but a problem with an MUA mislabeling the applicable
character set for the message).  Also, review the MIME RFCs if you are
still not convinced.  Note the following quotes from RFC 2045
to help you out:

    The transformation part of any Content-Transfer-Encodings
    specifies, either explicitly or implicitly, a single, well-defined
    decoding algorithm, which for any sequence of encoded octets either
    transforms it to the original sequence of octets which was encoded,
    or shows that it is illegal as an encoded sequence.
--> Content-Transfer-Encodings transformations never depend on any
--> additional external profile information for proper operation.  Note
--> that while decoders must produce a single, well-defined output for
    a valid encoding no such restrictions exist for encoders: Encoding
    a given sequence of octets to different, equivalent encoded
    sequences is perfectly legal.

And,

    The quoted-printable and base64 encodings transform their input
--> from an arbitrary domain into material in the "7bit" range, thus
    making it safe to carry over restricted transports. The specific
    definition of the transformations are given below.

And,

    NOTE: The five values defined for the Content-Transfer-Encoding
--> field imply nothing about the media type other than the algorithm
--> by which it was encoded or the transport system requirements if
    unencoded.


The first quote above implies that it is the MUA that should have
translated the 0x81 to 0xFC since it planned to label the text/plain
data with a charset of iso-8859-1, but the data was composed in the the
ibm charset.

        --ewh

----
             Earl Hood              | University of California: Irvine
      ehood(_at_)medusa(_dot_)acs(_dot_)uci(_dot_)edu      |      Electronic 
Loiterer
http://www.oac.uci.edu/indiv/ehood/ | Dabbler of SGML/WWW/Perl/MIME

<Prev in Thread] Current Thread [Next in Thread>