perl-unicode

Re: Warning messages for ill-formed data

2003-03-25 07:30:07
Autrijus (and Porters),

I think you are following this thread but in case you are not, Sadahiro-san proposes that some extraneous (and presumably unneeded) control characters in \x80-\xA0 in big5-eten map be removed to solve problems that arise in certain circumstances. Since these control characters are just duplicates at \x00-\x20, I think it is a good idea to go for it (and do the same to big5-hkscs.ucm). But I am not as sure of Big5 as you are please check if the proposal is right.
  If you affirm the idea, I'll $Encode::VERSION++.

Dan the Encode Maintainer

On Tuesday, Mar 25, 2003, at 21:53 Asia/Tokyo, SADAHIRO Tomoyuki wrote:
Well, is it right?

I'm not sure of the status and the single byte-range
for Big-5, though.

diff -urN ucm~/big5-eten.ucm ucm/big5-eten.ucm
--- ucm~/big5-eten.ucm  Thu Jan 23 23:21:00 2003
+++ ucm/big5-eten.ucm   Tue Mar 25 21:43:00 2003
@@ -137,38 +137,6 @@
 <U007E> \x7E |0 # TILDE
 <U007F> \x7F |0 # DELETE
 <U0080> \x80 |0 # <control>
-<U0081> \x81 |0 # <control>
-<U0082> \x82 |0 # BREAK PERMITTED HERE
-<U0083> \x83 |0 # NO BREAK HERE
-<U0084> \x84 |0 # <control>
-<U0085> \x85 |0 # NEXT LINE
-<U0086> \x86 |0 # START OF SELECTED AREA
-<U0087> \x87 |0 # END OF SELECTED AREA
-<U0088> \x88 |0 # CHARACTER TABULATION SET
-<U0089> \x89 |0 # CHARACTER TABULATION WITH JUSTIFICATION
-<U008A> \x8A |0 # LINE TABULATION SET
-<U008B> \x8B |0 # PARTIAL LINE DOWN
-<U008C> \x8C |0 # PARTIAL LINE UP
-<U008D> \x8D |0 # REVERSE LINE FEED
-<U008E> \x8E |0 # SINGLE SHIFT TWO
-<U008F> \x8F |0 # SINGLE SHIFT THREE
-<U0090> \x90 |0 # DEVICE CONTROL STRING
-<U0091> \x91 |0 # PRIVATE USE ONE
-<U0092> \x92 |0 # PRIVATE USE TWO
-<U0093> \x93 |0 # SET TRANSMIT STATE
-<U0094> \x94 |0 # CANCEL CHARACTER
-<U0095> \x95 |0 # MESSAGE WAITING
-<U0096> \x96 |0 # START OF GUARDED AREA
-<U0097> \x97 |0 # END OF GUARDED AREA
-<U0098> \x98 |0 # START OF STRING
-<U0099> \x99 |0 # <control>
-<U009A> \x9A |0 # SINGLE CHARACTER INTRODUCER
-<U009B> \x9B |0 # CONTROL SEQUENCE INTRODUCER
-<U009C> \x9C |0 # STRING TERMINATOR
-<U009D> \x9D |0 # OPERATING SYSTEM COMMAND
-<U009E> \x9E |0 # PRIVACY MESSAGE
-<U009F> \x9F |0 # APPLICATION PROGRAM COMMAND
-<U00A0> \xA0 |0 # NO-BREAK SPACE
 <U00A7> \xA1\xB1 |0
 <U00A8> \xC6\xD8 |0
 <U00AF> \xA1\xC2 |0
@@ -178,11 +146,6 @@
 <U00D7> \xA1\xD1 |0
 <U00F7> \xA1\xD2 |0
 <U00F8> \xC8\xFB |0
-<U00FA> \xFA |0 # LATIN SMALL LETTER U WITH ACUTE
-<U00FB> \xFC |0 # LATIN SMALL LETTER U WITH CIRCUMFLEX
-<U00FD> \xFD |0 # LATIN SMALL LETTER Y WITH ACUTE
-<U00FE> \xFE |0 # LATIN SMALL LETTER THORN
-<U00FF> \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
 <U014B> \xC8\xFC |0
 <U0153> \xC8\xFA |0
 <U0250> \xC8\xF6 |0
diff -urN ucm~/big5-hkscs.ucm ucm/big5-hkscs.ucm
--- ucm~/big5-hkscs.ucm Thu Jan 23 23:21:02 2003
+++ ucm/big5-hkscs.ucm  Tue Mar 25 21:37:10 2003
@@ -136,13 +136,6 @@
 <U007E> \x7E |0 # TILDE
 <U007F> \x7F |0 # DELETE
 <U0080> \x80 |0 # <control>
-<U0081> \x81 |0 # <control>
-<U0082> \x82 |0 # BREAK PERMITTED HERE
-<U0083> \x83 |0 # NO BREAK HERE
-<U0084> \x84 |0 # <control>
-<U0085> \x85 |0 # NEXT LINE
-<U0086> \x86 |0 # START OF SELECTED AREA
-<U0087> \x87 |0 # END OF SELECTED AREA
 <U00A7> \xA1\xB1 |0
 <U00A8> \xC6\xD8 |0
 <U00AF> \xA1\xC2 |0
@@ -171,7 +164,6 @@
 <U00F9> \x88\x7B |0
 <U00FA> \x88\x79 |0
 <U00FC> \x88\xA2 |0
-<U00FF> \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
 <U0100> \x88\x56 |0
 <U0101> \x88\x67 |0
 <U0112> \x88\x5A |0

Regards,
SADAHIRO Tomoyuki

I often encounter lower-ascii codes mixed in with Big5 text, which is
fine
and straightforward to handle.  However, a problem arises when upper
ascii occasionally occur outside of the Big5 range.  When such a
character occurs, this is probably an error or part of a user-defined
character.
However, it appears that Encode DOES NOT display warnings for these but
rather maps individual upper ascii to conventional characters such as
Roman letters with diacritics commonly found in European languages.
(It appears that Encode displays warnings for characters that are within
the Big5 range, but do not have a mapping to Unicode, perhaps because
these code points are not used in Big5 itself.)

Is there a way to cause Encode to display warnings for upper ascii
outside
of the Big5 range when converting from Big5 to Unicode?  If not, could
the
developers consider this for a future fix?

Mark



<Prev in Thread] Current Thread [Next in Thread>