perl-unicode

Re: Warning messages for ill-formed data

2003-03-21 15:30:08

SADAHIRO Tomoyuki <bqw10602(_at_)nifty(_dot_)com> said:

P.S. Another problem. How can it be determined whether that
user-defined character (UDC hereafter) is single-byte or double-byte? 

The file big5-eten.ucm does not contain how to determin the character
length in bytes for an unmapped UDC.

As I understand it, the "parsing" rules for big5 involve stepping 
through the character stream one byte at a time, and:

 - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one 
 complete character (*); otherwise:

 - when the byte just taken is in the range [\xA1-\xFE], you have the 
 first half of a 16-bit big5 character, and you need to get the next 
 byte as well; if that next byte is in the range [\x40-\x7E\xA1-\xFE], 
 then you now have a complete big5 code point

 - an initial byte in the range [\x80-\xA0\xFF] is presumably some form
 of noise, and should be discarded; likewise, when expecting the second
 byte of a big5 character, a byte in the range [\x00-\x3F\x7F-\xA0\xFF]
 is also noise, and presumably both this byte and the one preceding it 
 should be discarded. (**)

footnotes:

(*) If reading a plain text file, you would of course expect (hope) that
the ASCII codes are limited to just white-space and [\x21-\x7E] (and 
maybe \x07 "bell") -- i.e. no nulls, deletes, backspaces, EOT, etc; 
still, if these occur, they should behave as ASCII for purposes of 
parsing the characters.

(**) I'm really just guessing about what sort of action should be taken
when a stream violates the rules; discarding one or two bytes at a time
when they happen to be out of bounds should be the "safest" approach.

There is still the issue that those rules map out a very large range of
potential code points, many of which are not in fact used or defined in
Chinese.  Also, there must be some number of big5 code points that are
used/defined (at least by some big5 applications), but are not mapped to
Unicode.  How Perl "decode()" handles these cases may be a problem where
developers still have some work to do to fix things...

        Dave Graff