On Thu, 4 Apr 2002, Dan Kogai wrote:
Konnichiha !
(hope I got this one right).
On Thursday, April 4, 2002, at 03:06 , Jungshik Shin wrote:
o The MIME name as defined in IETF RFCs.
UCS-2 ucs2, iso-10646-1 [IANA, et al]
UCS-2le
UTF-8 utf8 [RFC2279]
----------------------------------------------------------------
How about UCS-2BE? Of course, if UCS-2 is network byte order
(big endian), it's not necessary. In that case, you may alias UCS-2
to UCS-2BE.
And UCS2-NB (Network Byte order)? Unicode terminology is confusing
sometimes.
I've checked http://www.unicode.org/glossary/ and it seems that the
canonical - alias order should be as follows.
UCS-2 ucs2, iso-10646-1, utf-16be
UTF-16LE ucs2-le
UTF-8 utf8
I left UCS-2 as is because it is IANA registered. UCS-2 is indeed a
name of encoding as the URL above clearly states. It is also less
confusing than UTF-16.
ucs2-le will be fixed.
IETF RFC 2781 also 'defines' (for IETF purpose) UTF-16LE, UTF-16BE, and
UTF-16. It's at http://www.faqs.org/rfcs/rfc2781.html among other places.
BTW, how does Encode deal with BOM in UTF-16? It's trivial to add
BOM at the beginning by hand (with perl), but you may consider
adding an option (??) to add/remove BOM automatically converting
to/from UTF-16(LE|BE).
Could you please just say 'Encoding vs Character Set'
and remove parenthetical 'charset for short' or 'just charset' following
'character set'? I agree to your distinction between 'encoding' and
'character set', but what is bothering me is that you treat 'charset'
as a synonym to 'character set'.
Now I agree. charset is more appropriate for "coded character set"
and that was MIME header's first intention. EUC is indeed a coded
character set but charset=ISO-2022-(JP|KP|CN)(-\d+)? is absolutely
confusing -- it is a character encoding scheme at best. I am thinking
of adding a small glossary to this document as follows.
And Here is a glossary I manually parsed out of
http://www.unicode.org/glossary/ , right after the signature.
Thank you. BTW, you may also want to take a look at W3C's charmod
TR at http://www.w3.org/TR/charmod and 'charset' part of html4 spec
at http://www.w3.org/TR/REC-html40/charset.html
In a strict sense, the concept of 'raw' or 'as-is' (which you
apparently use to mean a coded character set invoked on GL) is not
appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
characters to their GL position when enumerating characters in their
charts. The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
are row (ku) and column(ten?) while GB 2312-80 appears to use GL
codepoints. That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
column numbers.
I wonder whether ku-ten form is canonical or derived. JIS X 0208 was
clearly designed to be ISO-2022 compliant. Technically speaking
0x21-0x7e should the original and 1 - 94 is derived to make decimal
people happier. But you've got a point.
Maybe you're right. It may have made 'decimal-oriented people'
happier, but it's a pain in the ass to 'hexadecimal-oriented people'
like us, isn't it?
Speaking of '-raw' that's a BSD sense of calling unprocessed data and
for a Deamon freak it came out naturally.
All right. It's your decision :-)
are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.
Not that I'd encourage people to use UTF-16 for their web pages,
but UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
and Mozilla.
The problem is not just browsers. As a network consultant I would
advised against UTF-16 or any text encoding that may croak cat(1) and
more(1) (We can go frank on "Mojibake" For cases like mojibake, the
text goes to EOF). After all, we have UTF-8 already that good old cat
of ours can read till EOF with no problem.
Sure, I like UTF-8 much more than UTF-16 and any byte order dependent
and 'cat-breaking' :-) transformation formats of Unicode. I can assure
you that I'm certainly on your side ! Microsoft products generate UTF-8
with **totally redundant** BOM (byte order mark) at the beginning. I don't
know whether there's a conspiracy to break time-honored Unix tradition
of command line filtering, but it's certainly annoying to deal with UTF-8
files with BOM. For example, 'cat f1 f2 f3' wouldn't work as it is. 'cat'
and many other Unix tools need to be modified to remove 'BOM'.
L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
Somewhat obsolete (last update in 1996), but still useful. Also try
Is there any rule against mentioning a book in print as opposed
to online docs :-) ? Why don't you also refer to a successor to
CJK.inf, CJKV Information Processing with a very comprehensive coverage
on character sets and encodings.
No. I was just too lazy to browse for ISBN number and such (I know it
In case you haven't done that, here's the bibliography for the book:
Ken Lunde, CJKV Information Processing, 1999 O'Reilly & Associates,
ISBN : 1-56592-224-7
Jungshik