perl-unicode

Re: Long name rocks! But how about *.ecm?

2002-03-25 08:41:59

On Mon, 25 Mar 2002 21:56:08 +0900
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> wrote:

On Monday, March 25, 2002, at 09:37 , Nick Ing-Simmons wrote:

 in trouble?  Or perl on such systems are smart enough to load
UNIVERSA.pm (I guess this is the case).

They load UNIVERSAL.pm and the OS truncates it and finds UNIVERSA.pm.

  Size reduction was a byproduct of */Makefile.PL linting.
  As for "Encode::Supports", there is another concern in perldoc;  is
perldoc smart enough to 8.3-ize filenames?

Same logic as above works - name passed to OS is still the long one.

   Okay, I am convinced that we should stick with the original, long, 
user-friendly names but how about ucm-transitions?
   As of Encode-0.98, there are so many duped tables under Encode/ and I 
want to tidy it up if possible.  Well, for this I will wait what 
Sadahiro-san has to say....

hmm.... I'm not in opposition to it.

IMO, a more significant point might be 
which encodings are worth implemented in the core ship.
In other words, it's better to assess each encoding
which is supported only by Encode::Tcl.

AFAIK, such encodings includes ISO-2022-JP-2 and ISO-2022-CN.
(defined by 2022-jp2.enc and 2022-cn.enc, respectively)

But it may seem weird to encode to them,
since they have many many duplicates in definition.

Say, here is an example of ISO-2022-CN cited from RFC 1922.

      Example: the hex sequence

         1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f

      represents the Chinese word for "Interchange" (jiao huan) twice;

where, <3d 3b 3b 3b> is "jiao huan" in GB (GB 2312-80),
   and <47 28 5f 50> is "jiao huan" in CNS (CNS 11643 plane-1).

Then, decoding of it gives "\x{4ea4}\x{6362}\x{4ea4}\x{63db}".
"jiao" has mapped to the same code point in Unicode!

To encode "\x{4ea4}\x{6362}\x{4ea4}\x{63db}" to ISO-2022-CN
will give the following hex sequence:

   1b 24 29 41 0e 3d 3b 3b 3b 3d 3b 1b 24 29 47 5f 50 0f

where, <3d 3b 3b 3b 3d 3b> is "jiao huan jiao" in GB,
   and <5f 50> is "huan" in CNS.

How about it?

More confusing is ISO-2022-JP-2, as it has JIS/GB/KS characters.
Many kanji/hanzi/hanja are *triplicated*!
(Of course triplicates includes hiragana, katakana, Greek, etc.)

A solution to distinguish the languages may be tagging
but are they truly useful?

NOTE
  In Encode::Tcl::Escape::encode(), each character
  is retrived in order cited in the .enc file.

  Say, according to 2022-jp2.enc,
  jis0212 is preferred than gb2312,
  and gb2312 than ksc5601.

E
name            iso2022-jp2
init            {}
final           {}
ascii           \x1b(B
ascii           \x1b(J
jis0208         \x1b$B
jis0208         \x1b$@
jis0212         \x1b$(D
gb2312          \x1b$A
ksc5601         \x1b$(C
7bit-latin1     \x1b.A
7bit-greek      \x1b.F


   At leas euc-jp must be in *.ucm because it contains triple-bytes (JIS 
X 0212), which Encode::Tcl used to handle via Encode::Tcl::Extended but 
now ::Extended is gone....

Well, I've agreed it.
http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2002-03/msg00076.html

Dan the Encode Maintainer

Regards,
SADAHIRO Tomoyuki