RFC: Character sets in MHonArc (was private Re: MHonArc DOC )

1999-03-26 14:40:11
[Courtesy cc to MHonArc mailing list]

On March 26, 1999 at 15:58, "Alexander Voropay" wrote:

 P.S. Could you include russian charset "koi8-r" into supported charset
list in CHARSETCONVERTERS by default ? You can read more about
"koi8-r" at .

Someone may need to provide a charset converter for MHonArc to have it
included.  For example, iso-2022-jp support was contributed.

Note, the converter may be trivial depending on the characteristics of
koi8-r, but I know nothing about it.  Plus, it will probably be hard
for me to do something myself since I could not verify what I am doing
is right (If I knew Russian, it may not be a problem).  However, I am
willing to help out with anyone who is familiar with koi8-r to get
a converter written for MHonArc.

Looking at the site you gave, it appears koi8-r is 8-bit, and the 7-bit
characters coincide with US-ASCII.  Maybe the mhonarc::htmlize routine
will suffice as a base converter.

BTW, a potential problem with charsets in general is that HTML is not
good about supporting mixes charsets withing a document.  For example,
wrt to MIME, I can have multiple charset specifications in a single
message.  However, it appears that HTML only supports a global charset
specification for the entire document.  The CHARSET attribute (as
defined in HTML 4.0) is only used in elements that refer to external
entities and not on a per element basis.  For example, the following is
not possible:

<p charset="koi8-r">Some Russian text here ...</p>
<p charset="iso-8859-2">Latin 2 text here ...</p>
<p charset="iso-2022-jp">Latin 2 text here ...</p>

I guess one will get into the problems dealing with encoding issues.
Ie.  Charsets specifies how a given character is represented, but does
not deal with encoding.  I guess if documents adhere to an 8-bit
encoding scheme through out the entire document, conflict may not be
a problem.

In summary, a problem arises when one has something like the following:


In a single message header.

Unicode is potential solution, but I am unsure of the WWW client support
for unicode (and my technical knowledge of unicode is limited).

Comments?  Especially from Japanese-based users of MHonArc?

Are any users setting the <META http-equiv="Content-Type"
content="text/html; charset=XXXX"> in their MHonArc generated pages?
Or specifying a particular charset through the HTTP server?


             Earl Hood              | University of California: Irvine
      ehood(_at_)medusa(_dot_)acs(_dot_)uci(_dot_)edu      |      Electronic 
Loiterer | Dabbler of SGML/WWW/Perl/MIME

<Prev in Thread] Current Thread [Next in Thread>