perl-unicode

Re: [Encode] Encode::Supported revised

2002-04-04 13:07:33
On Thu, 4 Apr 2002, Dan Kogai wrote:


  Konnichiha !
  (hope I got this one right).

On Thursday, April 4, 2002, at 03:06 , Jungshik Shin wrote:
        o The MIME name as defined in IETF RFCs.
   UCS-2         ucs2, iso-10646-1                    [IANA, et al]
   UCS-2le
   UTF-8         utf8                                     [RFC2279]
   ----------------------------------------------------------------

  How about UCS-2BE? Of course, if UCS-2 is network byte order
(big endian), it's not necessary. In that case, you may alias UCS-2
to UCS-2BE.

   And UCS2-NB (Network Byte order)?  Unicode terminology is confusing 
sometimes.
   I've checked http://www.unicode.org/glossary/ and it seems that the 
canonical - alias order should be as follows.

    UCS-2         ucs2, iso-10646-1, utf-16be
    UTF-16LE      ucs2-le
    UTF-8         utf8

   I left UCS-2 as is because it is IANA registered. UCS-2 is indeed a 
name of encoding as the URL above clearly states.  It is also less 
confusing than UTF-16.
   ucs2-le will be fixed.

  IETF RFC 2781 also 'defines' (for IETF purpose) UTF-16LE, UTF-16BE, and
UTF-16. It's at http://www.faqs.org/rfcs/rfc2781.html  among other places.
BTW, how does Encode deal with BOM in UTF-16? It's trivial to add
BOM at the beginning by hand (with perl), but you may consider
adding an option (??) to add/remove BOM automatically converting
to/from UTF-16(LE|BE). 



  Could you please just say 'Encoding vs Character Set'
and remove parenthetical 'charset for short' or 'just charset' following
'character set'?  I agree to your distinction between 'encoding' and
'character set', but what is bothering me is that you treat 'charset'
as a synonym to 'character set'.

    Now I agree.  charset is more appropriate for "coded character set" 
and that was MIME header's first intention.  EUC is indeed a coded 
character set but charset=ISO-2022-(JP|KP|CN)(-\d+)?  is absolutely 
confusing --  it is a character encoding scheme at best.  I am thinking 
of adding a small glossary to this document as follows.

And  Here is a glossary I manually parsed out of 
http://www.unicode.org/glossary/ , right after the signature.

  Thank you. BTW, you may also want to take a look at W3C's charmod
TR at http://www.w3.org/TR/charmod and 'charset' part of html4 spec
at http://www.w3.org/TR/REC-html40/charset.html 


   In a strict sense, the concept of 'raw' or 'as-is' (which you
apparently use to mean a coded character set invoked on GL)  is not
appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
characters to their GL position when enumerating characters in their
charts. The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
codepoints. That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
column numbers.

   I wonder whether ku-ten form is canonical or derived.  JIS X 0208 was 
clearly designed to be ISO-2022 compliant.  Technically speaking 
0x21-0x7e should the original and 1 - 94 is derived to make decimal 
people happier.  But you've got a point.

  Maybe you're right. It may have made 'decimal-oriented people'
happier, but it's a pain in the ass to 'hexadecimal-oriented people'
like us, isn't it?


   Speaking of '-raw'  that's a BSD sense of calling unprocessed data and 
for a Deamon freak it came out naturally.

  All right. It's your decision :-)
  

are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.

  Not that I'd encourage people to use UTF-16 for their web pages,
but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
and Mozilla.

   The problem is not just browsers.  As a network consultant I would 
advised against UTF-16 or any text encoding that may croak cat(1) and 
more(1) (We can go frank on "Mojibake"  For cases like mojibake, the 
text goes to EOF).  After all, we have UTF-8 already that good old cat 
of ours can read till EOF with no problem.

  Sure, I like UTF-8 much more than UTF-16 and any byte order dependent
and 'cat-breaking' :-) transformation formats of Unicode. I can assure
you that I'm certainly on your side !  Microsoft products generate UTF-8
with **totally redundant** BOM (byte order mark) at the beginning. I don't
know whether there's a conspiracy to break time-honored Unix tradition
of command line filtering, but it's certainly annoying to deal with UTF-8
files with BOM. For example, 'cat f1 f2 f3' wouldn't work as it is. 'cat'
and many other Unix tools need to be modified to remove 'BOM'.

L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
Somewhat obsolete (last update in 1996), but still useful.  Also try

  Is there any rule against mentioning a book in print as opposed
to online docs :-) ?  Why don't you also  refer to a successor to
CJK.inf, CJKV Information Processing with a very comprehensive coverage
on character sets and encodings.

   No.  I was just too lazy to browse for ISBN number and such (I know it 

   In case you haven't done that, here's the bibliography for the book:

   Ken Lunde, CJKV Information Processing, 1999 O'Reilly & Associates,
     ISBN : 1-56592-224-7

    Jungshik