perl-unicode

[Encode/ISO-2022] KR is done. CN to go.

2002-03-27 17:30:42
Jungshik,

First, Thank you so much (as much as the number of code points for all Korean charset combined!) for submitting a patch so quickly. It was applied hairlessly. I am now hopeful that 1.00 will be shipped in next 24 hours. Coincidentally, it is 09:00 JST, meaning 00:00 Zulu.

On Thursday, March 28, 2002, at 08:01 , Jungshik Shin wrote:
  Yeah, that's a common mistake made by (Japanese) programmers when
they didn't bother to read RFC 1557 (or Ken Lunde's book) :-)

I wonder where that 2022.enc came from. Has nobody touched *.euc that came from Tcl? If so, NI-S, you should issue a warning to Tcl/Tk community!

  You're very welcome :-) But 2022-kr.enc is not used any more,
is it? I patched lib/Encode/KR/2022_KR.pm instead. It's not perfect for
encoding, but nobody really needs it any more...  ISO-2022-KR decoder
is still of use because there are some old emails floating around in
people's mailboxes and some (outdated) programs still generate it.
(this is why Mozilla has ISO-2022-KR decoder but doesn't have the encoder)

For ISO-2022 in general, decoder is easy but encoder is pain in the neck. The problem is where to insert the correct escape sequence and in order to do so, you have to know what character set you want to designate. And as we know very well, Unicode characters by itself tells nothing of the origin. This is the very reason that I am abstaining from implementing ISO-2022-JP-2, which has to handle JIS X 0208, 0212, GB2312, and KS C 5601 together. decoding to UTF-8 is easy via EUC-X (I have JP, KR, and CN already). But to encode back to UTF-8, you need to somehow tell which charset the character belongs but Character Unification makes it impossible. At very least, round-trip is impossible. You need to have a database whether or which charset a given Unicode character have a code point and give precedence to charset and pick the one accordingly. Since this is JP-2 we are talking about, I would try JIS X first, then GB, then KS C or something like that.... Fortunately (at least for me; (in)?famous morta-san may think otherwise) ISO-2022-JP-2 is not prevalent yet but the quick glance at google finds several remarks to =?ISO-2022-JP2?b..., obviously from ML archives. So they still in use, unlike ISO-2022-KR.

  ISO-2022-KR is very rarely used these days. It MUST NOT be
used for outgoing messages any more. However, the decoder is still handy
to have (see above.)

You capped MUST NOT. Not even *depreciated*. Is this de facto or de jure ?

   One (rather drastic) way to reduce the number of spam mails
is to just filter out email messages with MIME charset 'ks_c_5601-1987'
and C-T 'text/html'.

Well, a moderate number of spams are okay to me; I even enjoy them sometimes and they were useful in the course of forging Encode :)

 Spammers are much more likely to use non-standard
and broken mail programs than non-spammers (at least in Korea).

Glad to hear that. What is the socially accepted way to include Korean messages in MIME header? =?euc-kr?b... good enough? Or do you guys prefer quoted-printable? Or Korea is so much into the future and =?UTF-8?b= is the standard :?

  In case of ISO-2022-KR, you could have used 'ksc5601-raw' just like
HZ.pm uses 'gb2312-raw'.  That's not the case in ISO-2022-CN encoding,
though. For ISO-2022-CN decoding, I believe you can still go without
mock encoding but can use cns11643p1 and cns11643p2 along with gb2312-raw.

Right. So far as decoding to UTF-8 is concerned, you don't need EUC so long as you have raw encodings. Maybe I was too obsessed with an idea of bidirectionality. I feel more relieved now with your words. But I still feel somewhat arrogant to leave the door half-open, or in this particular case, a trap door. If I were a die-hard Unicode activist, I would have made only decode() available and coerce UTF-8 for all output :)

  Has ISO-2022-CN ever been used for email exchanges? The lead
engineer of Pine dev. team at U. of Washington (whose name is escaping
me at the moment) and one of the author of RFC 1922 once wrote that he
had received a handful of emails in ISO-2022-CN, but I have yet
to receive a single message in ISO-2022-CN with both GB2312
and CNS 11643-[12].

  I have no idea either.  Let's wait Autrijus on this....

  Alternative way to deal with it at the moment is just support US-ASCII
and GB2312.

That I have done. But It was too well-documented in RFC and there is no such things as ISO-2022-CN-0, or a souped-down version thereof.

Pls, take a look at my patch for ISO-2022-KR and modify it as you see fit.
(I haven't set up my perl-testing env. yet so that I didn't test it).

I have. Another welcome thing is test data. See t/*.euc and t/*.ref. t/(JP|KR).t does a round-trip matching test to see if it is okay.

  Anyway, Kamsahamnida!

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>