Jungshik,
First, Thank you so much (as much as the number of code points for all
Korean charset combined!) for submitting a patch so quickly. It was
applied hairlessly.
I am now hopeful that 1.00 will be shipped in next 24 hours.
Coincidentally, it is 09:00 JST, meaning 00:00 Zulu.
On Thursday, March 28, 2002, at 08:01 , Jungshik Shin wrote:
Yeah, that's a common mistake made by (Japanese) programmers when
they didn't bother to read RFC 1557 (or Ken Lunde's book) :-)
I wonder where that 2022.enc came from. Has nobody touched *.euc that
came from Tcl? If so, NI-S, you should issue a warning to Tcl/Tk
community!
You're very welcome :-) But 2022-kr.enc is not used any more,
is it? I patched lib/Encode/KR/2022_KR.pm instead. It's not perfect
for
encoding, but nobody really needs it any more... ISO-2022-KR decoder
is still of use because there are some old emails floating around in
people's mailboxes and some (outdated) programs still generate it.
(this is why Mozilla has ISO-2022-KR decoder but doesn't have the
encoder)
For ISO-2022 in general, decoder is easy but encoder is pain in the
neck. The problem is where to insert the correct escape sequence and in
order to do so, you have to know what character set you want to
designate. And as we know very well, Unicode characters by itself tells
nothing of the origin. This is the very reason that I am abstaining
from implementing ISO-2022-JP-2, which has to handle JIS X 0208, 0212,
GB2312, and KS C 5601 together. decoding to UTF-8 is easy via EUC-X (I
have JP, KR, and CN already). But to encode back to UTF-8, you need to
somehow tell which charset the character belongs but Character
Unification makes it impossible. At very least, round-trip is
impossible. You need to have a database whether or which charset a
given Unicode character have a code point and give precedence to charset
and pick the one accordingly. Since this is JP-2 we are talking about,
I would try JIS X first, then GB, then KS C or something like that....
Fortunately (at least for me; (in)?famous morta-san may think
otherwise)
ISO-2022-JP-2 is not prevalent yet but the quick glance at google finds
several remarks to =?ISO-2022-JP2?b..., obviously from ML archives. So
they still in use, unlike ISO-2022-KR.
ISO-2022-KR is very rarely used these days. It MUST NOT be
used for outgoing messages any more. However, the decoder is still handy
to have (see above.)
You capped MUST NOT. Not even *depreciated*. Is this de facto or de
jure ?
One (rather drastic) way to reduce the number of spam mails
is to just filter out email messages with MIME charset 'ks_c_5601-1987'
and C-T 'text/html'.
Well, a moderate number of spams are okay to me; I even enjoy them
sometimes and they were useful in the course of forging Encode :)
Spammers are much more likely to use non-standard
and broken mail programs than non-spammers (at least in Korea).
Glad to hear that. What is the socially accepted way to include
Korean messages in MIME header? =?euc-kr?b... good enough? Or do you
guys prefer quoted-printable? Or Korea is so much into the future and
=?UTF-8?b= is the standard :?
In case of ISO-2022-KR, you could have used 'ksc5601-raw' just like
HZ.pm uses 'gb2312-raw'. That's not the case in ISO-2022-CN encoding,
though. For ISO-2022-CN decoding, I believe you can still go without
mock encoding but can use cns11643p1 and cns11643p2 along with
gb2312-raw.
Right. So far as decoding to UTF-8 is concerned, you don't need EUC
so long as you have raw encodings. Maybe I was too obsessed with an
idea of bidirectionality. I feel more relieved now with your words.
But I still feel somewhat arrogant to leave the door half-open, or in
this particular case, a trap door.
If I were a die-hard Unicode activist, I would have made only decode()
available and coerce UTF-8 for all output :)
Has ISO-2022-CN ever been used for email exchanges? The lead
engineer of Pine dev. team at U. of Washington (whose name is escaping
me at the moment) and one of the author of RFC 1922 once wrote that he
had received a handful of emails in ISO-2022-CN, but I have yet
to receive a single message in ISO-2022-CN with both GB2312
and CNS 11643-[12].
I have no idea either. Let's wait Autrijus on this....
Alternative way to deal with it at the moment is just support US-ASCII
and GB2312.
That I have done. But It was too well-documented in RFC and there is
no such things as ISO-2022-CN-0, or a souped-down version thereof.
Pls, take a look at my patch for ISO-2022-KR and modify it as you see
fit.
(I haven't set up my perl-testing env. yet so that I didn't test it).
I have. Another welcome thing is test data. See t/*.euc and
t/*.ref. t/(JP|KR).t does a round-trip matching test to see if it is
okay.
Anyway, Kamsahamnida!
Dan the Encode Maintainer