Re: Converting special characters into entities.

On August 12, 1999 at 18:18, "Peter Seitz jun." wrote:

I am archiving a german language discussion group and there are lots 
of umlauts in these mails.

I'd like to convert the umlauts into entities so these umlauts can be 
read on various platforms (windows, Macintosh) correctly. I was not 
able to find out what I have to put into my resource files.

Can someone please help?


Sure.  The answer will differ depending on if you are dealing
with message header data or message body data.

Header:
     CHARSETCONVERTERS are invoked when non-ASCII extension encoding
     is encountered in message headers.  That is the =?...?.?...?=
     stuff.  Now if the umlauts are in encoded as such, you can
     get the effect you want.

     By default MHonArc will convert 8-bit characters into entity
     references, with the exception of iso-8859-1 character data.
     The reasons is that most browsers default to iso-8859-1.
     To change this, have something like the following in your
     resource file:

     <CharsetConverters>
     iso-8859-1;     iso_8859::str2sgml;     iso8859.pl
     </CharsetConverters>

     If you a non-encoded/raw 8-bit character in the message
     header, MHonArc keeps it as-is.  To force a conversion to
     an entity reference would require code changes to MHonArc
     itself.

Body:
     You'll have to tweak the text/plain filter to call
     iso_8859::str2sgml when iso-8859-1 character data is
     specified (it is already invoked for iso-8859-[2-10]), and probably
     call iso_8859::str2sgml by default if you know there are
     messages that do not specify a charset parameter in
     the Content-Type field, but the message contains 8-bit
     characters.

     I should probably modify the text/plain filter to use
     the functions specified in CHARSETCONVERTERS instead
     of having a hard-coded mapping.  The CHARSETCONVERTERS is
     only checked for "-decode-" settings.

     Note, the use iso_8859::str2sgml does incur a performance
     penalty.  See
     <http://www.xray.mpe.mpg.de/mailing-lists/mhonarc/1998-02/msg00083.html>
     (message-id 
<199802210058(_dot_)QAA05071(_at_)medusa(_dot_)acs(_dot_)uci(_dot_)edu>) for
     more information.

--ewh