perl-unicode

Re: Inverse of /\p{script}/

2003-08-29 03:30:05
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:

But that is not good enough for cases below because...

 (Hiragana | Katakana | Han) => 'jisx0208.1990-0'

This is very wrong because jisx0208.1990-0 only contains \p{Han} that 
appears in Japanese (JIS X 0208, to be exact).  On the other hand, 
jisx0208.1990-0 does contain greek and cyrillic alphabets.

But cyrillic glyphs are likely double width :-(
This is one of reasons I want to do _something_ in this area.
I don't want to even try and read a big 16-bit Japanese font 
just to get cyrillic (for SPAMer's name) or greek Sigma (for math).

The other thing that needs fixing is that Tk currently ignores 
any locale information that might be available. So for "unified" ideographs
it will use a font that has the character regardless of which "style" it is
in. So for Japanese it is quite likely to find a simplified Chinese style
font and use that for Han, then when it hits Katakana it will find 
an 8-bit (JIS201?) font and use that for those, then when it finds 
a Hiragana it will find a JIS 208 font. The result looks a mess even
to my occidental eyes.

What I am hoping to do for Tk804 is put some kind of callback to perl
hook in so that when Tk wants a font for a particular character it 
can call to perl and perl will give it strong push in a particular direction.
Thus for someone expecting Japanese if asked for a Han character 
it will suggest a JIS font. While for someone expecting Chinese it 
will suggest a Big5 or gb2312 font as appropriate.

What gets really painful is the Unicode fonts - one has to look at 
which characters it has to decide if it 
Japanese/Simplified Chinese/Traditional Chinese/Korean or just a grab-bag 
of glyphs font designer had to hand. 


One of so many reasons why Han Unification was a bad idea.  When it 
comes to Han Ideographs, Unicode's sense of charscript is almost 
useless.

\x{5c0f}\x{98fc} \x{5f3e}

<Prev in Thread] Current Thread [Next in Thread>