Re: RFC: Japanese Text Conversion and other language issues

From: Earl Hood <earl(_at_)earlhood(_dot_)com>
Subject: RFC: Japanese Text Conversion and other language issues
Date: Sat, 30 Nov 2002 22:23:13 -0600

    Should MHonArc::CharEnt replace iso2022.pl as the default
    CHARSETCONVERTER for iso-2022-jp text?


I prefer not to replace it.
The reasons are:

(1) I think Namazu cannot treat Unicode character entity
    references as it is, so changing the default might
    confuse MHonArc+Namazu users.
    (In fact, this statement is not accurate. I'll describe
     more details later in this mail).

(2) Human unreadable (i.e., poor maintainability)
    Imagine if `Hello' written as
    `&#x48;&#x65;&#x6c;&#x6c;&#x6f;'.
    You might say `The files generated by MHonArc don't need
    to be viewed except via web browsers'.
    Nevertheless, it is also true that sometimes I needed to
    see them for maintenance.

(3) Some softwares cannot read it.
    This is also concerning maintainability.

(4) File size is bigger.
    This might be trivial problem.

MHonArc::CharEnt tries to map everything to HTML entity references,
allowing for the ability of multiple languages to co-exist.


Yes, this is a great (and admirable) advantage.
But, fortunately or unfortunately, we have few multiple
languages co-existing messages.

I recognized that another advantage to use entity
references: We can use Kanji characters in rc file.
For example, we might want to write `Next' in Japanese like
this:

<NextButton chop>
[<a href="$MSG(NEXT)$">ESC-$-B < ! ESC-(-B</a> ($MSG(NEXT)$)]
</NextButton>

but this does not work (second resource variable won't be
expanded) because `$' is included in Kanji.
(This example is somewhat contrived because I needed a
 resource variable AFTER Kanji.)

I've not checked yet, but I think we can use Kanji
Characters in rc file if we use MHonArc::CharEnt.
(I don't know if we need to write it as entity references,
though.)

A related question that impacts mharc: Does Namazu handle Unicode
character entity references?


I think not.
More precisely, (at least concerning Japanese) it seems what
must handle Unicode character entity references is nkf,
because Namazu invokes nkf interally and converts Kanji code
of text.
And nkf does not support Unicode character entity
references.
In other words, Namazu will be able to handle Unicode
character entity references when nkf supports them, I think.
Anyway, I need further investigation about it.



Finally, I should tell you that these are my personal
opinion, and I don't know what other Japanese users think.

I'm planning to write Earl's RFC in my web page (in
Japanese) to ask for other users' opinion.
However, please wait for a few days because I have a bad
cold now...

-- 
Takashi P.KATOH

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV