perl-unicode

Re: Warning messages for ill-formed data

2003-03-21 18:30:04

SADAHIRO Tomoyuki <bqw10602(_at_)nifty(_dot_)com> said:

P.S. Another problem. How can it be determined whether that
user-defined character (UDC hereafter) is single-byte or double-byte? 

The file big5-eten.ucm does not contain how to determin the character
length in bytes for an unmapped UDC.

As I understand it, the "parsing" rules for big5 involve stepping 
through the character stream one byte at a time, and:

 - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one 
 complete character (*); otherwise:

 - when the byte just taken is in the range [\xA1-\xFE], you have the 
 first half of a 16-bit big5 character, and you need to get the next 
 byte as well; if that next byte is in the range [\x40-\x7E\xA1-\xFE], 
 then you now have a complete big5 code point

 - an initial byte in the range [\x80-\xA0\xFF] is presumably some form
 of noise, and should be discarded; likewise, when expecting the second
 byte of a big5 character, a byte in the range [\x00-\x3F\x7F-\xA0\xFF]
 is also noise, and presumably both this byte and the one preceding it 
 should be discarded. (**)

Right, but such a noise may be due to confusion
with CP-950 or BIG-5 HKSCS (or others?).
They have some character mapping in the area of leading byte \x81-\xA0.
We can use decode 'cp950' or decode 'big5-hkscs', though.

Well, the problem is possibly due to "big-5" has many, many variants.
  (cf. http://i18n.linux.org.tw/openi18n/big5/index_en.html )

footnotes:
(snip)

There is still the issue that those rules map out a very large range of
potential code points, many of which are not in fact used or defined in
Chinese.  Also, there must be some number of big5 code points that are
used/defined (at least by some big5 applications), but are not mapped to
Unicode.  How Perl "decode()" handles these cases may be a problem where
developers still have some work to do to fix things...

      Dave Graff

For example, Microsoft defines mapping
of extended UDC (EUDC) to Private Use Area (PUA) in Unicode.
These mapping can be computed algorithmically like following.

sub eudc2pua { # E000..F848
    my $cp = shift;

    if ($cp =~ /^([\x81-\x8D])([\x40-\x7E\xA1-\xFE])/) { # EEB8..F6B0
        my $le = ord($1);
        my $tr = ord($2);
        return 0xeeb8 +
            ($le - 0x81) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^([\x8E-\xA0])([\x40-\x7E\xA1-\xFE])/) { # E311..EEB7
        my $le = ord($1);
        my $tr = ord($2);
        return 0xe311 +
            ($le - 0x8e) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^\xC6([\xA1-\xFE])/) { # F6B1..F70E
        my $tr = ord($1);
        return 0xf6b1 + $tr - 0xA1;
    }
    if ($cp =~ /^([\xC7\xC8])([\x40-\x7E\xA1-\xFE])/) { # F70F..F848
        my $le = ord($1);
        my $tr = ord($2);
        return 0xf70f +
            ($le - 0xc7) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^([\xFA-\xFE])([\x40-\x7E\xA1-\xFE])/) { # E000..E310
        my $le = ord($1);
        my $tr = ord($2);
        return 0xe000 +
            ($le - 0xfa) * 0x9d + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    return;
}


sub pua2eudc {
    my $uv = shift;
    if (0xe000 <= $uv && $uv <= 0xe310) {
        $uv -= 0xe000;
        my $tr = $uv % 0x9D + 0x40;
        return pack 'CC', int($uv/0x9D) + 0xFA,
             $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xe311 <= $uv && $uv <= 0xeeb7) {
        $uv -= 0xe311;
        my $tr = $uv % 0x9D + 0x40;
        return pack 'CC', int($uv/0x9D) + 0x8E,
            $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xeeb8 <= $uv && $uv <= 0xf6b0) {
        $uv -= 0xeeb8;
        my $tr = $uv % 0x9D + 0x40;
        return pack 'CC', int($uv/0x9D) + 0x81,
            $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xf6b1 <= $uv && $uv <= 0xf70e) {
        $uv -= 0xf6b1;
        return pack 'CC', 0xC6, $uv + 0xA1;
    }
    if (0xf70f <= $uv && $uv <= 0xf848) {
        $uv -= 0xf70f;
        my $tr = $uv % 0x9D + 0x40;
        return pack 'CC', int($uv/0x9D) + 0xC7,
            $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    return;
}

P.S. This EUDC mapping *was* available from Microsoft typography,
 ( http://www.microsoft.com/typography/default.asp )
but that file has been deleted.  Though I don't know the reason,
I guess it is (maybe) because the mapping was an older version
than that distributed now under www.unicode.org/Public/MAPPINGS.

However the fact that the leading byte range
for CP-950 is \x81-\xfe is shown in
  http://www.microsoft.com/globaldev/reference/dbcs/950.htm
 (additional leadbytes are identified by a darker gray background)
and in
  http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT

SADAHIRO Tomoyuki