On Mon, 25 Mar 2002 21:56:08 +0900
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> wrote:
On Monday, March 25, 2002, at 09:37 , Nick Ing-Simmons wrote:
in trouble? Or perl on such systems are smart enough to load
UNIVERSA.pm (I guess this is the case).
They load UNIVERSAL.pm and the OS truncates it and finds UNIVERSA.pm.
Size reduction was a byproduct of */Makefile.PL linting.
As for "Encode::Supports", there is another concern in perldoc; is
perldoc smart enough to 8.3-ize filenames?
Same logic as above works - name passed to OS is still the long one.
Okay, I am convinced that we should stick with the original, long,
user-friendly names but how about ucm-transitions?
As of Encode-0.98, there are so many duped tables under Encode/ and I
want to tidy it up if possible. Well, for this I will wait what
Sadahiro-san has to say....
hmm.... I'm not in opposition to it.
IMO, a more significant point might be
which encodings are worth implemented in the core ship.
In other words, it's better to assess each encoding
which is supported only by Encode::Tcl.
AFAIK, such encodings includes ISO-2022-JP-2 and ISO-2022-CN.
(defined by 2022-jp2.enc and 2022-cn.enc, respectively)
But it may seem weird to encode to them,
since they have many many duplicates in definition.
Say, here is an example of ISO-2022-CN cited from RFC 1922.
Example: the hex sequence
1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f
represents the Chinese word for "Interchange" (jiao huan) twice;
where, <3d 3b 3b 3b> is "jiao huan" in GB (GB 2312-80),
and <47 28 5f 50> is "jiao huan" in CNS (CNS 11643 plane-1).
Then, decoding of it gives "\x{4ea4}\x{6362}\x{4ea4}\x{63db}".
"jiao" has mapped to the same code point in Unicode!
To encode "\x{4ea4}\x{6362}\x{4ea4}\x{63db}" to ISO-2022-CN
will give the following hex sequence:
1b 24 29 41 0e 3d 3b 3b 3b 3d 3b 1b 24 29 47 5f 50 0f
where, <3d 3b 3b 3b 3d 3b> is "jiao huan jiao" in GB,
and <5f 50> is "huan" in CNS.
How about it?
More confusing is ISO-2022-JP-2, as it has JIS/GB/KS characters.
Many kanji/hanzi/hanja are *triplicated*!
(Of course triplicates includes hiragana, katakana, Greek, etc.)
A solution to distinguish the languages may be tagging
but are they truly useful?
NOTE
In Encode::Tcl::Escape::encode(), each character
is retrived in order cited in the .enc file.
Say, according to 2022-jp2.enc,
jis0212 is preferred than gb2312,
and gb2312 than ksc5601.
E
name iso2022-jp2
init {}
final {}
ascii \x1b(B
ascii \x1b(J
jis0208 \x1b$B
jis0208 \x1b$@
jis0212 \x1b$(D
gb2312 \x1b$A
ksc5601 \x1b$(C
7bit-latin1 \x1b.A
7bit-greek \x1b.F
At leas euc-jp must be in *.ucm because it contains triple-bytes (JIS
X 0212), which Encode::Tcl used to handle via Encode::Tcl::Extended but
now ::Extended is gone....
Well, I've agreed it.
http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2002-03/msg00076.html
Dan the Encode Maintainer
Regards,
SADAHIRO Tomoyuki