perl-unicode

RE: Warning messages for ill-formed data

2003-03-24 22:30:06
I often encounter lower-ascii codes mixed in with Big5 text, which is
fine
and straightforward to handle.  However, a problem arises when upper
ascii occasionally occur outside of the Big5 range.  When such a
character occurs, this is probably an error or part of a user-defined
character.
However, it appears that Encode DOES NOT display warnings for these but
rather maps individual upper ascii to conventional characters such as
Roman letters with diacritics commonly found in European languages.
(It appears that Encode displays warnings for characters that are within
the Big5 range, but do not have a mapping to Unicode, perhaps because
these code points are not used in Big5 itself.)  

Is there a way to cause Encode to display warnings for upper ascii
outside
of the Big5 range when converting from Big5 to Unicode?  If not, could
the 
developers consider this for a future fix?

Mark

 
P.S. Another problem. How can it be determined whether that
user-defined character (UDC hereafter) is single-byte or
double-byte? 

The file big5-eten.ucm does not contain how to determin the
character
length in bytes for an unmapped UDC.


As I understand it, the "parsing" rules for big5 involve stepping 
through the character stream one byte at a time, and:

 - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one 
 complete character (*); otherwise:

 - when the byte just taken is in the range [\xA1-\xFE], you have the 
 first half of a 16-bit big5 character, and you need to get the next 
 byte as well; if that next byte is in the range 
[\x40-\x7E\xA1-\xFE], 
 then you now have a complete big5 code point

 - an initial byte in the range [\x80-\xA0\xFF] is presumably 
some form
 of noise, and should be discarded; likewise, when expecting 
the second
 byte of a big5 character, a byte in the range 
[\x00-\x3F\x7F-\xA0\xFF]
 is also noise, and presumably both this byte and the one 
preceding it 
 should be discarded. (**)


footnotes:

(*) If reading a plain text file, you would of course expect 
(hope) that
the ASCII codes are limited to just white-space and [\x21-\x7E] (and 
maybe \x07 "bell") -- i.e. no nulls, deletes, backspaces, EOT, etc; 
still, if these occur, they should behave as ASCII for purposes of 
parsing the characters.

(**) I'm really just guessing about what sort of action 
should be taken
when a stream violates the rules; discarding one or two bytes 
at a time
when they happen to be out of bounds should be the "safest" approach.

There is still the issue that those rules map out a very 
large range of
potential code points, many of which are not in fact used or 
defined in
Chinese.  Also, there must be some number of big5 code points that are
used/defined (at least by some big5 applications), but are 
not mapped to
Unicode.  How Perl "decode()" handles these cases may be a 
problem where
developers still have some work to do to fix things...