perl-unicode

Re: [Encode/ISO-2022] KR is done. CN to go.

2002-03-27 18:42:34
On Thu, 28 Mar 2002, Dan Kogai wrote:

 Dan,

   First, Thank you so much (as much as the number of code points for all 
Korean charset combined!) for submitting a patch so quickly.  It was 
applied hairlessly.

  Oh, that's too much more than I really deserve :-)

   I am now hopeful that 1.00 will be shipped in next 24 hours.  
Coincidentally, it is 09:00 JST, meaning 00:00 Zulu.

   I'd rather say UTC instead of Zulu :-) 

On Thursday, March 28, 2002, at 08:01 , Jungshik Shin wrote:
  Yeah, that's a common mistake made by (Japanese) programmers when
they didn't bother to read RFC 1557 (or Ken Lunde's book) :-)

   I wonder where that 2022.enc came from.  Has nobody touched *.euc that 
came from Tcl?  If so, NI-S, you should issue a warning to Tcl/Tk 
community!

  That'll be great. NI-S, could you alert Tcl/Tk community to this
issue?  Although I understand why every programming language/ dev. library
has to reinvent the wheel (for the sake of portability), it really is a
big headache to monitor every single one of them for potential mistakes
like this. Recently, I talked to the author of a popular web-bbs program
in Korea written in PHP. The status of multibyte encoding support in
PHP is at best primitive (as is usually the case, Japanese encodings
are more or less supported but not other East Asian encodings). It
can't even handle multibyte encodings which uses the GL range  for the
second or third octet (SJIS, Big5, GBK, CP949, etc) because it began
with supporting only ISO-8859-1.


encoding, but nobody really needs it any more...  ISO-2022-KR decoder
is still of use because there are some old emails floating around in
people's mailboxes and some (outdated) programs still generate it.
(this is why Mozilla has ISO-2022-KR decoder but doesn't have the 
encoder)

  ISO-2022-KR is very rarely used these days. It MUST NOT be
used for outgoing messages any more. However, the decoder is still handy
to have (see above.)

   You capped MUST NOT.  Not even *depreciated*.  Is this de facto or de 
jure ?

   
  Perhaps, de facto because we never revised RFC 1557 (see below).
However, it can be argued that there's no need because it's not
standard-track but just informational. Maybe, I used too strong a
wording. Major mail programs retained ability to decode ISO-2022-KR.
However, most web mail services cannot handle ISO-2022-KR and that's
why ISO-2022-KR should never be used for outgoing emails.

  As for revising RFC 1557, we tried to draft a new RFC on Korean email
exchange around 1997
and 1998 because it's obvious that ISO-2022-KR(7bit) had seen its day
and it's time for it to rest :-) with major email programs (MUA and
MTA) supporting MIME (base64/q-p) and 8BITMIME extension negotiation
mechanism. However, the effort got nowhere because people from Microsoft
insisted that we should give up EUC-KR in favor of ks_c_5601-1987 or
something similar. Most other people including Ken Lunde, Erik van de
Poel(the author of RFC 14xx for ISO-2022-JP), Frank Tang (netscape),
Woohyung CHOI (the author of RFC 1557 for ISO-2022-KR), Kyungseok GIM
(Korean representative to ISO/IEC JTC1/SC2/WG2 and JTC1/SC22/WG22)
pitched in and made their cases for EUC-KR. That debate even made a
couple of articles in major Korean newspapers and even a public hearing
was held in Seoul with an official from MoIC(Ministry of Information and
Communication). Anyway, the flaw of MS designation became crystal clear
when KSA changed the name of KS C 5601 to KS X 1001 and a tentative
conclusion was that we couldn't use ks_c_5601*. However, MS went onto
use it nonetheless. Now with the browser market completely dominated by
MS IE and the OS market still dominated by MS-Windows, 'ks_c_5601-1987'
is everywhere. Mozilla and Linux/Unix/Mac users still adhere to EUC-KR.

   One (rather drastic) way to reduce the number of spam mails
is to just filter out email messages with MIME charset 'ks_c_5601-1987'
and C-T 'text/html'.

   Well, a moderate number of spams are okay to me;  I even enjoy them 
sometimes and they were useful in the course of forging Encode :)

 What I do is to use procmail to collect potential spams in a separate
folder (of course, I have a more fine-tuned filter than the above) and
drop in there from time to time to look for anything interesting.


 Spammers are much more likely to use non-standard
and broken mail programs than non-spammers (at least in Korea).

   Glad to hear that.  What is the socially accepted way to include 
Korean messages in MIME header?  =?euc-kr?b...  good enough?  Or do you 
guys prefer quoted-printable?  Or Korea is so much into the future and 
=?UTF-8?b= is the standard :?

   Right now, EUC-KR with B or Q encoding is widely used by
Linux/Unix/Mozilla/MacOS users and some web mail services.  (Last time I
checked the most popular web mail service in Korea did not RFC 2047-encode
message headers. I wrote them several times, but I gave up. It took me
several email messages to make them replace 'KST' with '+0900 (KST)'
in Date: header).  However, I think we have to move onto UTF-8 as soon
as possible and that's a consensus among Korean user community.


   I have.  Another welcome thing is test data.  See t/*.euc and 
t/*.ref.  t/(JP|KR).t does a round-trip matching test to see if it is 
okay.

  That's nice. I'll try to build Perl dev-snapshot (I finally squeezed
out some disk space. Hmm, hard disk is cheap and I should buy a 100GB
disk....) and see Encode in action.

   Anyway, Kamsahamnida!

  Chonmaneyo !

   Jungshik

<Prev in Thread] Current Thread [Next in Thread>