InLanguage properties? [Was Re: Encode-InCharset-0.01 Released]

On Friday, May 3, 2002, at 04:33 , Roman Vasicek wrote:

On Friday, May 3, 2002, at 02:41 , Dan Kogai wrote:

I have just released Encode-InCharset-0.01, available as

 http://www.dan.co.jp/~dankogai/Encode-InCharset-0.01.tar.gz and CPAN.
I have developed this module primarily to implement ISO-2022-JP-3 andISO-2022-CN in future. To implement encode() in these, you have toknow which character set a given character belongs. But this modulecan also be used if a string can safely be encoded
(Though fallback is much faster).
Great! Good work.

I have one, may be off topic question. Is there module which provide the
same functionality for languages? I mean something like IsGerman,IsCzech,
etc.

Be our guest ;) To my knowledge there is none but it won't be toohard to implement -- for Roman script languages. You just start withISO_8599 variants and subtract the ones you don't need.

I consider this be one of the problems of Unicode (as of now). Whenyou aggregate anything, usually the source of origin is lost. It isjust the same as you can't retrieve 1+1 back from 2 (it could be 0+2 or-1+3 or anything).To overcome this shortage Unicode does have character properties andyou can get which I<script> it belongs to using that. But unfortunatelythat was not the case for the origins of character repertoire (so I madeone (Encode-InCharset) because I needed it). Neither is the case forLanguages.Maybe Encode-InCharset-0.01 can help implement InLanguage, especiallyfor complex CJK cases. Here is a crude (and possibly incorrect)definition of InNihongo;


$InNihongo =~ qr/(?=
                                \p{InJISX0213_1} |
                                \p{InJISX0213_2} |
                                \p{InASCII}
                                )
                           (?:
                                \p{Hiragana} |
                                \p{Katakana} |
                                \p{Han} |
                                \p{InBasicLatin} | # contemporary!
                   )/xo;

Notice it is prepended by InJISX0213_1 and InJISX0213_2. Otherwise allHan Ideographs that are not used in Japanese will also be consideredNihongo.



Dan the Encode Maintainer