Re: RFC: Japanese Text Conversion and other language issues

On December 3, 2002 at 13:16, "Takashi P.KATOH" wrote:

From: Earl Hood <earl(_at_)earlhood(_dot_)com>
Subject: RFC: Japanese Text Conversion and other language issues
Date: Sat, 30 Nov 2002 22:23:13 -0600

    Should MHonArc::CharEnt replace iso2022.pl as the default
    CHARSETCONVERTER for iso-2022-jp text?


I prefer not to replace it.
The reasons are:

(1) I think Namazu cannot treat Unicode character entity
    references as it is, so changing the default might
    confuse MHonArc+Namazu users.
    (In fact, this statement is not accurate. I'll describe
     more details later in this mail).


Thanks for looking into it.  I think Unicode character entity
references are important for all languages.  For example, it is quite
common to use Unicode and/or numeric character entity references
for latin-based languages.  HTML 4.0 only defines a small set of
named entities.

(2) Human unreadable (i.e., poor maintainability)
    Imagine if `Hello' written as
    `&#x48;&#x65;&#x6c;&#x6c;&#x6f;'.
    You might say `The files generated by MHonArc don't need
    to be viewed except via web browsers'.
    Nevertheless, it is also true that sometimes I needed to
    see them for maintenance.


I understand the need to view the raw HTML, but I think this is
an issue with a select few, and only admin/tech types.  Your comment
would also apply if all data is in UTF-8 (unless of course you
have access to a UTF-8-aware editor/viewer).

ASCII text will be left as-is, but you are correct that Japanese
characters will all be represented as &#HHHH;, making it hard
to read the raw data.

(3) Some softwares cannot read it.
    This is also concerning maintainability.


Yep, but it may be a hit that needs to be taken in order to
solve charset soup.

BTW, can you provide some real-world example software (besides Namazu)?

MHonArc::CharEnt tries to map everything to HTML entity references,
allowing for the ability of multiple languages to co-exist.


Yes, this is a great (and admirable) advantage.
But, fortunately or unfortunately, we have few multiple
languages co-existing messages.


I agree that for many locales, archives tend to contain messages
of that locale.  However, I'm also trying to consider users that
run large archives of multiple lists comprised of multiple languages.

I recognized that another advantage to use entity
references: We can use Kanji characters in rc file.
For example, we might want to write `Next' in Japanese like
this:

<NextButton chop>
[<a href="$MSG(NEXT)$">ESC-$-B < ! ESC-(-B</a> ($MSG(NEXT)$)]
</NextButton>

but this does not work (second resource variable won't be
expanded) because `$' is included in Kanji.
(This example is somewhat contrived because I needed a
 resource variable AFTER Kanji.)


Have your tried using the VARREGEX resource to minimize rc file
conflicts?

I've not checked yet, but I think we can use Kanji
Characters in rc file if we use MHonArc::CharEnt.
(I don't know if we need to write it as entity references,
though.)


MHonArc::CharEnt and rc files are independent.  I.e. MHonArc::CharEnt
knows nothing about processing rc files and vice-versa.  Hence, you
could use character entity references in your rc files and still use
iso2022jp.pl for converting message text.

Finally, I should tell you that these are my personal
opinion, and I don't know what other Japanese users think.


All opinions count, and I appreciate your response.  I have no real
problem leaving iso2022jp.pl as the default.  I'll just have to add
something in the docs about it and that MHonArc::CharEnt can be used
if desired.  Something I can add to the release notes.

I'm planning to write Earl's RFC in my web page (in
Japanese) to ask for other users' opinion.


Thanks.  Make sure to note that iso2022jp.pl is NOT going away.
Hence, users can specify explicitly if they want to make sure
it is used.


On a related note, I'm going to see about adding a new resource called
MSGTXTENCODE that would give users the ability to pre-convert all
message text entities to the specified encoding (an idea suggested
by Moofie).  The resource will only work if Unicode::MapUTF8 or
Encode module is installed, which means only those using Perl >=
5.6 will be able to use the feature.

The reason for the resource is that other text/* types can introduce
foreign character encodings (e.g. text/html).  Therefore, a user
would still not be able to complete control the character encoding
of archive pages (unless they only allow text/plain messages).

However, please wait for a few days because I have a bad
cold now...


Hope you get better,

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV