perl-unicode

InLanguage properties? [Was Re: Encode-InCharset-0.01 Released]

2002-05-03 01:52:46
On Friday, May 3, 2002, at 04:33 , Roman Vasicek wrote:
On Friday, May 3, 2002, at 02:41 , Dan Kogai wrote:

I have just released Encode-InCharset-0.01, available as

 http://www.dan.co.jp/~dankogai/Encode-InCharset-0.01.tar.gz and CPAN.

I have developed this module primarily to implement ISO-2022-JP-3 and ISO-2022-CN in future. To implement encode() in these, you have to know which character set a given character belongs. But this module can also be used if a string can safely be encoded
(Though fallback is much faster).

Great! Good work.

I have one, may be off topic question. Is there module which provide the
same functionality for languages? I mean something like IsGerman, IsCzech,
etc.

Be our guest ;) To my knowledge there is none but it won't be too hard to implement -- for Roman script languages. You just start with ISO_8599 variants and subtract the ones you don't need.

I consider this be one of the problems of Unicode (as of now). When you aggregate anything, usually the source of origin is lost. It is just the same as you can't retrieve 1+1 back from 2 (it could be 0+2 or -1+3 or anything). To overcome this shortage Unicode does have character properties and you can get which I<script> it belongs to using that. But unfortunately that was not the case for the origins of character repertoire (so I made one (Encode-InCharset) because I needed it). Neither is the case for Languages. Maybe Encode-InCharset-0.01 can help implement InLanguage, especially for complex CJK cases. Here is a crude (and possibly incorrect) definition of InNihongo;

$InNihongo =~ qr/(?=
                                \p{InJISX0213_1} |
                                \p{InJISX0213_2} |
                                \p{InASCII}
                                )
                           (?:
                                \p{Hiragana} |
                                \p{Katakana} |
                                \p{Han} |
                                \p{InBasicLatin} | # contemporary!
                   )/xo;

Notice it is prepended by InJISX0213_1 and InJISX0213_2. Otherwise all Han Ideographs that are not used in Japanese will also be considered Nihongo.


Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>