perl-unicode

Re: Is \p{EastAsianFullwidth} worth implementing?

2002-09-18 20:30:08
On Thursday, Sep 19, 2002, at 11:39 Asia/Tokyo, Autrijus Tang wrote:
Hi there. Recently I need to do some hacking based on the EastAsianWidth
property (cf. http://www.unicode.org/unicode/reports/tr11/) of unicode
characters. Naturally, I tried the regular expression \p{} and \P{} syntax,
with no avail.

Come to think of EastAsianWidth, I needed that property when I wrote unidump (under Encode/bin, not installed by default). It looks like as follows;

    # Generated out of lib/unicore/EastAsianWidth.txt
    # will it work ?
    #
    our $IsFullWidth =
        qr/^[
             \x{1100}-\x{1159}
             \x{115F}-\x{115F}
             \x{2329}-\x{232A}
             \x{2E80}-\x{2E99}
             \x{2E9B}-\x{2EF3}
             \x{2F00}-\x{2FD5}
             \x{2FF0}-\x{2FFB}
             \x{3000}-\x{303E}
             \x{3041}-\x{3096}
             \x{3099}-\x{30FF}
             \x{3105}-\x{312C}
             \x{3131}-\x{318E}
             \x{3190}-\x{31B7}
             \x{31F0}-\x{321C}
             \x{3220}-\x{3243}
             \x{3251}-\x{327B}
             \x{327F}-\x{32CB}
             \x{32D0}-\x{32FE}
             \x{3300}-\x{3376}
             \x{337B}-\x{33DD}
             \x{3400}-\x{4DB5}
             \x{4E00}-\x{9FA5}
             \x{33E0}-\x{33FE}
             \x{A000}-\x{A48C}
             \x{AC00}-\x{D7A3}
             \x{A490}-\x{A4C6}
             \x{F900}-\x{FA2D}
             \x{FA30}-\x{FA6A}
             \x{FE30}-\x{FE46}
             \x{FE49}-\x{FE52}
             \x{FE54}-\x{FE66}
             \x{FE68}-\x{FE6B}
             \x{FF01}-\x{FF60}
             \x{FFE0}-\x{FFE6}
             \x{20000}-\x{2A6D6}
         ]$/xo;

Naturally, I can hack up a local patch to unicore/{Canonical,Exact}.pl
and parse the yet-unused unicore/EastAsianWidth.txt to add the desired
properties in, namely (better names welcome):

        \p{En}          \p{EastAsianNeutral}
        \p{Ea}          \p{EastAsianAmbiguous}
        \p{Eh}          \p{EastAsianHalfwidth}
        \p{Ew}          \p{EastAsianWide}
        \p{Ef}          \p{EastAsianFullwidth}
        \p{Ena}         \p{EastAsianNarrow}

But as it overrides core modules's behaviours, I'd hesitate to release it
as a CPAN module (Unicode::EastAsianWidth), but rather suggest it to
be included in core perl.

Are there any hidden drawbacks or other problems with this idea?

Full/Half width was not supposed to be a part of character encoding ideally but we all know we need that in practice, especially when you need to render those chars nice and tidy in fixed-width fonts (that's why I came up w/ a quick and dirty hack above -- it's a unicode-savvy hexdump). So I second the idea of adding East Asian Width properties SOMEHOW.

I said somehow because I am not so sure if it requires tweaking the core. I think we can reached the goal in a same manner as my humble Encode::InCharset, a module I declined to add to Encode.

Dan the Man with Too Many Character Properties to Remember, Too Few to Feel Practical