perl-unicode

Re: Warning messages for ill-formed data

2003-03-25 06:30:05

Well, is it right?

I'm not sure of the status and the single byte-range
for Big-5, though.

diff -urN ucm~/big5-eten.ucm ucm/big5-eten.ucm
--- ucm~/big5-eten.ucm  Thu Jan 23 23:21:00 2003
+++ ucm/big5-eten.ucm   Tue Mar 25 21:43:00 2003
@@ -137,38 +137,6 @@
 <U007E> \x7E |0 # TILDE
 <U007F> \x7F |0 # DELETE
 <U0080> \x80 |0 # <control>
-<U0081> \x81 |0 # <control>
-<U0082> \x82 |0 # BREAK PERMITTED HERE
-<U0083> \x83 |0 # NO BREAK HERE
-<U0084> \x84 |0 # <control>
-<U0085> \x85 |0 # NEXT LINE
-<U0086> \x86 |0 # START OF SELECTED AREA
-<U0087> \x87 |0 # END OF SELECTED AREA
-<U0088> \x88 |0 # CHARACTER TABULATION SET
-<U0089> \x89 |0 # CHARACTER TABULATION WITH JUSTIFICATION
-<U008A> \x8A |0 # LINE TABULATION SET
-<U008B> \x8B |0 # PARTIAL LINE DOWN
-<U008C> \x8C |0 # PARTIAL LINE UP
-<U008D> \x8D |0 # REVERSE LINE FEED
-<U008E> \x8E |0 # SINGLE SHIFT TWO
-<U008F> \x8F |0 # SINGLE SHIFT THREE
-<U0090> \x90 |0 # DEVICE CONTROL STRING
-<U0091> \x91 |0 # PRIVATE USE ONE
-<U0092> \x92 |0 # PRIVATE USE TWO
-<U0093> \x93 |0 # SET TRANSMIT STATE
-<U0094> \x94 |0 # CANCEL CHARACTER
-<U0095> \x95 |0 # MESSAGE WAITING
-<U0096> \x96 |0 # START OF GUARDED AREA
-<U0097> \x97 |0 # END OF GUARDED AREA
-<U0098> \x98 |0 # START OF STRING
-<U0099> \x99 |0 # <control>
-<U009A> \x9A |0 # SINGLE CHARACTER INTRODUCER
-<U009B> \x9B |0 # CONTROL SEQUENCE INTRODUCER
-<U009C> \x9C |0 # STRING TERMINATOR
-<U009D> \x9D |0 # OPERATING SYSTEM COMMAND
-<U009E> \x9E |0 # PRIVACY MESSAGE
-<U009F> \x9F |0 # APPLICATION PROGRAM COMMAND
-<U00A0> \xA0 |0 # NO-BREAK SPACE
 <U00A7> \xA1\xB1 |0
 <U00A8> \xC6\xD8 |0
 <U00AF> \xA1\xC2 |0
@@ -178,11 +146,6 @@
 <U00D7> \xA1\xD1 |0
 <U00F7> \xA1\xD2 |0
 <U00F8> \xC8\xFB |0
-<U00FA> \xFA |0 # LATIN SMALL LETTER U WITH ACUTE
-<U00FB> \xFC |0 # LATIN SMALL LETTER U WITH CIRCUMFLEX
-<U00FD> \xFD |0 # LATIN SMALL LETTER Y WITH ACUTE
-<U00FE> \xFE |0 # LATIN SMALL LETTER THORN
-<U00FF> \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
 <U014B> \xC8\xFC |0
 <U0153> \xC8\xFA |0
 <U0250> \xC8\xF6 |0
diff -urN ucm~/big5-hkscs.ucm ucm/big5-hkscs.ucm
--- ucm~/big5-hkscs.ucm Thu Jan 23 23:21:02 2003
+++ ucm/big5-hkscs.ucm  Tue Mar 25 21:37:10 2003
@@ -136,13 +136,6 @@
 <U007E> \x7E |0 # TILDE
 <U007F> \x7F |0 # DELETE
 <U0080> \x80 |0 # <control>
-<U0081> \x81 |0 # <control>
-<U0082> \x82 |0 # BREAK PERMITTED HERE
-<U0083> \x83 |0 # NO BREAK HERE
-<U0084> \x84 |0 # <control>
-<U0085> \x85 |0 # NEXT LINE
-<U0086> \x86 |0 # START OF SELECTED AREA
-<U0087> \x87 |0 # END OF SELECTED AREA
 <U00A7> \xA1\xB1 |0
 <U00A8> \xC6\xD8 |0
 <U00AF> \xA1\xC2 |0
@@ -171,7 +164,6 @@
 <U00F9> \x88\x7B |0
 <U00FA> \x88\x79 |0
 <U00FC> \x88\xA2 |0
-<U00FF> \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
 <U0100> \x88\x56 |0
 <U0101> \x88\x67 |0
 <U0112> \x88\x5A |0

Regards,
SADAHIRO Tomoyuki

I often encounter lower-ascii codes mixed in with Big5 text, which is
fine
and straightforward to handle.  However, a problem arises when upper
ascii occasionally occur outside of the Big5 range.  When such a
character occurs, this is probably an error or part of a user-defined
character.
However, it appears that Encode DOES NOT display warnings for these but
rather maps individual upper ascii to conventional characters such as
Roman letters with diacritics commonly found in European languages.
(It appears that Encode displays warnings for characters that are within
the Big5 range, but do not have a mapping to Unicode, perhaps because
these code points are not used in Big5 itself.)  

Is there a way to cause Encode to display warnings for upper ascii
outside
of the Big5 range when converting from Big5 to Unicode?  If not, could
the 
developers consider this for a future fix?

Mark


<Prev in Thread] Current Thread [Next in Thread>