perl-unicode

Re: splitting devanagari characters

2000-04-04 07:47:26

Shigeki Moro <moro(_at_)ya(_dot_)sakura(_dot_)ne(_dot_)jp> writes:

Dear subscribers,

I wrote a report in Japanese concerned with the management of Devanagari
(one of the Indic scripts) characters on Perl 5.6.

http://www.ya.sakura.ne.jp/~moro/resources/indic_on_perl5.6/index.html

For example, using utf8, splitting a Devanagari word 'vij~naana' into
character semantics results in 'va + (i) + ja + (viraama) + ~na + (aa) +
na'. 

It seems to me that Perl divides a combined character into the base
character and the combining character(s), and doesn't regard a combined
character as one character.

Yes. After all, in some cases, you do want to manipulate base and
combining chars separately, which would be impossible if they
were treated as a single characters.

To split into (base char + combining chars) sequences

 split /(?=\PM)/ $string 

should work.

[
 \pM matches combining chars 
 \PM matches non-combining (base) characters

 So this says - split the string using the beginning of a base
 character as the delimiter
]

Regards,
                                        Owen

 

<Prev in Thread] Current Thread [Next in Thread>