SADAHIRO Tomoyuki <bqw10602(_at_)nifty(_dot_)com> said:
P.S. Another problem. How can it be determined whether that
user-defined character (UDC hereafter) is single-byte or double-byte?
The file big5-eten.ucm does not contain how to determin the character
length in bytes for an unmapped UDC.
As I understand it, the "parsing" rules for big5 involve stepping
through the character stream one byte at a time, and:
- if the byte just taken is 7-bit ASCII (hi-bit clear), you have one
complete character (*); otherwise:
- when the byte just taken is in the range [\xA1-\xFE], you have the
first half of a 16-bit big5 character, and you need to get the next
byte as well; if that next byte is in the range [\x40-\x7E\xA1-\xFE],
then you now have a complete big5 code point
- an initial byte in the range [\x80-\xA0\xFF] is presumably some form
of noise, and should be discarded; likewise, when expecting the second
byte of a big5 character, a byte in the range [\x00-\x3F\x7F-\xA0\xFF]
is also noise, and presumably both this byte and the one preceding it
should be discarded. (**)
footnotes:
(*) If reading a plain text file, you would of course expect (hope) that
the ASCII codes are limited to just white-space and [\x21-\x7E] (and
maybe \x07 "bell") -- i.e. no nulls, deletes, backspaces, EOT, etc;
still, if these occur, they should behave as ASCII for purposes of
parsing the characters.
(**) I'm really just guessing about what sort of action should be taken
when a stream violates the rules; discarding one or two bytes at a time
when they happen to be out of bounds should be the "safest" approach.
There is still the issue that those rules map out a very large range of
potential code points, many of which are not in fact used or defined in
Chinese. Also, there must be some number of big5 code points that are
used/defined (at least by some big5 applications), but are not mapped to
Unicode. How Perl "decode()" handles these cases may be a problem where
developers still have some work to do to fix things...
Dave Graff