Re: RFC: Japanese Text Conversion and other language issues

After some thought, I'm inclined to change the default value for
iso-2022-jp to MHonArc::CharEnt::str2sgml.

Reason: By default, MHonArc should be as locale neutral as possible.
The iso2022jp.pl filter is specific to a particular locale.  Because of
this reasoning, I will also change the default value for iso-8859-1
to MHonArc::CharEnt::str2sgml.  The use of mhonarc::htmlize assumes
a Latin 1-based locale since only HTML specials are converted.

Now, the iso2022jp.pl will still be available.  I will add a note
under the "Compatibility Notes" section of the release notes about
the change.  The wording will be as follows:


  UPGRADING FROM v2.5.x OR EARLIER: Default iso-2022-jp Converter Changed

  In v2.6, the default charset converter for iso-2022-jp has changed to
  the following:

  <CharsetConverters>
  iso-2022-jp; MHonArc::CharEnt::str2sgml; MHonArc/CharEnt.pm
  </CharsetConverters>

  This filter converts all Japanese characters into Unicode character
  entity references (e.g. &#x7279;) removing the iso-2022-jp encoding.
  For some Japanese locales, this type of conversion may not be desired
  since some Japanese-aware processing tools may not support Unicode
  character entity references. If you want to preserve the iso-2022-jp
  encoding, you must explicitly specify the use of
  iso_2022_jp::str2html via the CHARSETCONVERTERS resource as follows:

  <CharsetConverters>
  iso-2022-jp; iso_2022_jp::str2html; iso2022jp.pl
  </CharsetConverters>

  The change to MHonArc::CharEnt::str2sgml as the default converter for
  iso-2022-jp was done to make MHonArc as locale neutral as possible in
  its default configuration.

  For more information about using MHonArc in a Japanese locale, see
  (documents in Japanese):
  <http://www.shiratori.riec.tohoku.ac.jp/~p-katoh/Hack/Docs/mhonarc-jp/
   index.html>
  <http://www.shiratori.riec.tohoku.ac.jp/~p-katoh/Hack/Docs/mhonarc-jp/
   mhonarc-jp-2_4.html>


I figure there will be some objections to the change, but the main
principle of locale neutrality is important IMO.  Remember, this
is only the default setting.  Other locales that desire to avoid
Unicode character entity references will have to change
CHARSETCONVERTERS also.  For 8-bit sets, mhonarc::htmlize
can be used.

BTW, I plan to document the various charset converter functions
available in the CHARSETCONVERTERS resource page in a similiar manner
that MIMEFILTERS documents the various filters that are available.

Feedback is welcome.  v2.6 is still some time away, so there is
time to provide counter arguments to my decision.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV