[Encode/ISO-2022] KR is done. CN to go.

Jungshik,

First, Thank you so much (as much as the number of code points for allKorean charset combined!) for submitting a patch so quickly. It wasapplied hairlessly.I am now hopeful that 1.00 will be shipped in next 24 hours.Coincidentally, it is 09:00 JST, meaning 00:00 Zulu.


On Thursday, March 28, 2002, at 08:01 , Jungshik Shin wrote:

  Yeah, that's a common mistake made by (Japanese) programmers when
they didn't bother to read RFC 1557 (or Ken Lunde's book) :-)

I wonder where that 2022.enc came from. Has nobody touched *.euc thatcame from Tcl? If so, NI-S, you should issue a warning to Tcl/Tkcommunity!

  You're very welcome :-) But 2022-kr.enc is not used any more,
is it? I patched lib/Encode/KR/2022_KR.pm instead. It's not perfectfor
encoding, but nobody really needs it any more...  ISO-2022-KR decoder
is still of use because there are some old emails floating around in
people's mailboxes and some (outdated) programs still generate it.
(this is why Mozilla has ISO-2022-KR decoder but doesn't have theencoder)

For ISO-2022 in general, decoder is easy but encoder is pain in theneck. The problem is where to insert the correct escape sequence and inorder to do so, you have to know what character set you want todesignate. And as we know very well, Unicode characters by itself tellsnothing of the origin. This is the very reason that I am abstainingfrom implementing ISO-2022-JP-2, which has to handle JIS X 0208, 0212,GB2312, and KS C 5601 together. decoding to UTF-8 is easy via EUC-X (Ihave JP, KR, and CN already). But to encode back to UTF-8, you need tosomehow tell which charset the character belongs but CharacterUnification makes it impossible. At very least, round-trip isimpossible. You need to have a database whether or which charset agiven Unicode character have a code point and give precedence to charsetand pick the one accordingly. Since this is JP-2 we are talking about,I would try JIS X first, then GB, then KS C or something like that....Fortunately (at least for me; (in)?famous morta-san may thinkotherwise)ISO-2022-JP-2 is not prevalent yet but the quick glance at google findsseveral remarks to =?ISO-2022-JP2?b..., obviously from ML archives. Sothey still in use, unlike ISO-2022-KR.

  ISO-2022-KR is very rarely used these days. It MUST NOT be
used for outgoing messages any more. However, the decoder is still handy
to have (see above.)

You capped MUST NOT. Not even *depreciated*. Is this de facto or dejure ?

   One (rather drastic) way to reduce the number of spam mails
is to just filter out email messages with MIME charset 'ks_c_5601-1987'
and C-T 'text/html'.

Well, a moderate number of spams are okay to me; I even enjoy themsometimes and they were useful in the course of forging Encode :)

 Spammers are much more likely to use non-standard
and broken mail programs than non-spammers (at least in Korea).

Glad to hear that. What is the socially accepted way to includeKorean messages in MIME header? =?euc-kr?b... good enough? Or do youguys prefer quoted-printable? Or Korea is so much into the future and=?UTF-8?b= is the standard :?

  In case of ISO-2022-KR, you could have used 'ksc5601-raw' just like
HZ.pm uses 'gb2312-raw'.  That's not the case in ISO-2022-CN encoding,
though. For ISO-2022-CN decoding, I believe you can still go without

mock encoding but can use cns11643p1 and cns11643p2 along withgb2312-raw.

Right. So far as decoding to UTF-8 is concerned, you don't need EUCso long as you have raw encodings. Maybe I was too obsessed with anidea of bidirectionality. I feel more relieved now with your words.But I still feel somewhat arrogant to leave the door half-open, or inthis particular case, a trap door.If I were a die-hard Unicode activist, I would have made only decode()available and coerce UTF-8 for all output :)

  Has ISO-2022-CN ever been used for email exchanges? The lead
engineer of Pine dev. team at U. of Washington (whose name is escaping
me at the moment) and one of the author of RFC 1922 once wrote that he
had received a handful of emails in ISO-2022-CN, but I have yet
to receive a single message in ISO-2022-CN with both GB2312
and CNS 11643-[12].


  I have no idea either.  Let's wait Autrijus on this....

  Alternative way to deal with it at the moment is just support US-ASCII
and GB2312.

That I have done. But It was too well-documented in RFC and there isno such things as ISO-2022-CN-0, or a souped-down version thereof.

Pls, take a look at my patch for ISO-2022-KR and modify it as you seefit.
(I haven't set up my perl-testing env. yet so that I didn't test it).

I have. Another welcome thing is test data. See t/*.euc andt/*.ref. t/(JP|KR).t does a round-trip matching test to see if it isokay.


  Anyway, Kamsahamnida!

Dan the Encode Maintainer