perl-unicode

Re: 5.8 roadmap and Encode

2002-03-02 12:16:24
Autrijus Tang <autrijus(_at_)autrijus(_dot_)org> writes:
On Sat, Mar 02, 2002 at 11:12:42AM +0000, Nick Ing-Simmons wrote:
This and euc-tw use 1, 2 or 4-byte encoding. Any points on how to use
that functionality for Encode.pm?
The .ucm format can cope:

Thanks! I'm done with conversion and tested against libiconv. Patch follows;
files are available at <http://autrijus.org/ucm.tar.gz>.

Libiconv's GB18030 table elicited some warnings from compile:

   Unicode character 0xfdXX is illegal at ../compile line 81, <E> line 39659.

There are some other warnings running compile without -Q 

e.g. the attached.

It seems that some of these encoding are not round-trip safe.
One reason for prefering .ucm is that by declaring one of multiple
map chars a fallback one can get the "right" thing for e.g. <U00F3>
is that 2B2E or 282E ?

The range is question is fdxx and ffxx. Is that anything to worry about?

Also, the resulting file size is quite hefty:

-rw-r--r--  1 root  512  1688107 Mar  2 19:51 euc-tw.ucm
-rw-r--r--  1 root  512  1543333 Mar  2 19:51 gb18030.ucm

And they add ~600k to the compressed perl distribution. Is that acceptable?


The good news is there won't be anything else that big coming from the Chinese
front; aside from HZ, perl's support could be considered complete.


Test case?

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

/home/perl5/perlio/perl -I../../lib compile -So /tmp/Encode/iso-ir-165.ucm 
Encode/iso-ir-165.enc
D encoded iso-ir-165
U03B1 is 283B and 2641
UFF47 is 2840 and 2367
U1FB1 is 2B21 and 2821
U03AC is 2B22 and 2822
U1FB0 is 2B23 and 2823
U1F70 is 2B24 and 2824
U0113 is 2B25 and 2825
U00E9 is 2B26 and 2826
U011B is 2B27 and 2827
U00E8 is 2B28 and 2828
U012B is 2B29 and 2829
U00ED is 2B2A and 282A
U01D0 is 2B2B and 282B
U00EC is 2B2C and 282C
U014D is 2B2D and 282D
U00F3 is 2B2E and 282E
U01D2 is 2B2F and 282F
U00F2 is 2B30 and 2830
U016B is 2B31 and 2831
U00FA is 2B32 and 2832
U01D4 is 2B33 and 2833
U00F9 is 2B34 and 2834
U01D6 is 2B35 and 2835
U01D8 is 2B36 and 2836
U01DA is 2B37 and 2837
U01DC is 2B38 and 2838
U00FC is 2B39 and 2839
U00EA is 2B3A and 283A
U03B1 is 2B3B and 2641
U1E3F is 2B3C and 283C
U0144 is 2B3D and 283D
U0148 is 2B3E and 283E
U01F9 is 2B3F and 283F
UFF47 is 2B40 and 2367
34 mapping conflicts

<Prev in Thread] Current Thread [Next in Thread>