perl-unicode

Re: Combining characters in front of base characters after normalization

2002-02-28 08:50:50

On Thu, 28 Feb 2002 14:29:44 +0000
Markus Kuhn <Markus(_dot_)Kuhn(_at_)cl(_dot_)cam(_dot_)ac(_dot_)uk> wrote:

I understand that Perl 5.8 will contain modules for normalizing Unicode
strings, for example into Normalization Form C.

If this is the case, here a hopefully simple to implement and I think
very useful suggestion:

Would it be possible to add to the normalization function that turns
everything into combining characters a "reverse order" option that
causes the combining characters to precede the base character in stead
of to follow it?

In Unicode, combining characters follow the base characters. In many
other environments, it is the other way round (TeX: \" + a -> 
. So if
I want to write a Unicode to TeX converter, I will first bring Unicode
to Normalization Form C, then I have to move the combining characters in
front of the base character in reverse order, and finally I can
substitute them with the appropriate TeX sequence.

It would certainly be far more convenient and efficient, if the "move
the combining characters in front of the base character" step could
already be done in the normalization routine, because it has already
split up the string in memory appropriately.

I think regular expressions must help you.
They are powerful and unrestricted.
(We may have a sequence of Hangul-Jamo-Initial + Hangul-Jamo-Medial +
Combining Character. In this case, we might be better to move
the combining character before Hangul-Jamo-Initial, 
not before Hangul-Jamo-Medial. 
But sorry, I don't know which is useful for TeX.)

#!perl
use charnames qw(:full);
use Unicode::UCD qw(charinfo);

$_ = "A\N{COMBINING GRAVE ACCENT} la fe\N{COMBINING CIRCUMFLEX ACCENT}te";

s/(\PM)(\pM)/$2$1/g;

print charinfo($_)->{name},"\n" for map ord, split //, $_;
__END__

COMBINING GRAVE ACCENT
LATIN CAPITAL LETTER A
SPACE
LATIN SMALL LETTER L
LATIN SMALL LETTER A
SPACE
LATIN SMALL LETTER F
COMBINING CIRCUMFLEX ACCENT
LATIN SMALL LETTER E
LATIN SMALL LETTER T
LATIN SMALL LETTER E

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Regards,
SADAHIRO Tomoyuki

<Prev in Thread] Current Thread [Next in Thread>