mharc-users

Re: Polish characters in Namazu

2003-04-15 09:46:32
On April 15, 2003 at 14:40, Bartosz Feñski wrote:

By the way, the actual article in the mharc-users list seems
printing-quotable encoding. Namazu couldn't handle such data.
So the point is that Namazu couldn't work with ISO-8859-2 characters ?

The quoted-printable is irrelevant.  Namazu is indexing the mhonarc
message pages, so the quoted-printable data would have been decoded
by mhonarc before namazu indexes the file.

And character entity references in HTML file is olso supported by
Namazu. I think the probrem is in this case.
Is there any way to fix it ?
I've got locale set to pl_PL (ISO-8859-2).
This is an only hint I've found in documentation of Namazu.

Doing a simple experiment it appears namazu is 8-bit charset agnostic.
Looking at the Perl filters for Namazu, however, shows some potential
problems with numeric character entity references, like ę,
latin small letter e with ogonek (the code point 0xEA in ISO-8859-2
maps to the Unicode code point 0x119).

In Namazu's html.pl filter, the routine decode_numbered_entity() does
not appear to support numeric entities greater than 127.  Therefore,
something like ę gets mapped to the empty string.

The routine could (should?) be changed to allow values up to decimal
255, but in this case, it will not help since 0x119 is 281 decimal,
making it greater than an 8-bit value.

Therefore, you could configure mhonarc to not have it convert 8-bit
iso-8859-2 characters into entity references, making it the default
locale set.  For example:

<CharsetConverters>
iso-8859-2; mhonarc::htmlize
</CharsetConverters>

If you do this, you should change the IDXPGBEGIN, TIDXPGBEGIN, and
MSGPGBEGIN resource to include the following:

  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">

So browsers know that iso-8859-2 is the default document character set.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHARC-USERS