On August 12, 1999 at 18:18, "Peter Seitz jun." wrote:
I am archiving a german language discussion group and there are lots
of umlauts in these mails.
I'd like to convert the umlauts into entities so these umlauts can be
read on various platforms (windows, Macintosh) correctly. I was not
able to find out what I have to put into my resource files.
Can someone please help?
Sure. The answer will differ depending on if you are dealing
with message header data or message body data.
Header:
CHARSETCONVERTERS are invoked when non-ASCII extension encoding
is encountered in message headers. That is the =?...?.?...?=
stuff. Now if the umlauts are in encoded as such, you can
get the effect you want.
By default MHonArc will convert 8-bit characters into entity
references, with the exception of iso-8859-1 character data.
The reasons is that most browsers default to iso-8859-1.
To change this, have something like the following in your
resource file:
<CharsetConverters>
iso-8859-1; iso_8859::str2sgml; iso8859.pl
</CharsetConverters>
If you a non-encoded/raw 8-bit character in the message
header, MHonArc keeps it as-is. To force a conversion to
an entity reference would require code changes to MHonArc
itself.
Body:
You'll have to tweak the text/plain filter to call
iso_8859::str2sgml when iso-8859-1 character data is
specified (it is already invoked for iso-8859-[2-10]), and probably
call iso_8859::str2sgml by default if you know there are
messages that do not specify a charset parameter in
the Content-Type field, but the message contains 8-bit
characters.
I should probably modify the text/plain filter to use
the functions specified in CHARSETCONVERTERS instead
of having a hard-coded mapping. The CHARSETCONVERTERS is
only checked for "-decode-" settings.
Note, the use iso_8859::str2sgml does incur a performance
penalty. See
<http://www.xray.mpe.mpg.de/mailing-lists/mhonarc/1998-02/msg00083.html>
(message-id
<199802210058(_dot_)QAA05071(_at_)medusa(_dot_)acs(_dot_)uci(_dot_)edu>) for
more information.
--ewh