Re: Combining characters in front of base characters after normalization


On Thu, 28 Feb 2002 14:29:44 +0000
Markus Kuhn <Markus(_dot_)Kuhn(_at_)cl(_dot_)cam(_dot_)ac(_dot_)uk> wrote:

I understand that Perl 5.8 will contain modules for normalizing Unicode
strings, for example into Normalization Form C.

If this is the case, here a hopefully simple to implement and I think
very useful suggestion:

Would it be possible to add to the normalization function that turns
everything into combining characters a "reverse order" option that
causes the combining characters to precede the base character in stead
of to follow it?

In Unicode, combining characters follow the base characters. In many
other environments, it is the other way round (TeX: \" + a ->

. So if

I want to write a Unicode to TeX converter, I will first bring Unicode
to Normalization Form C, then I have to move the combining characters in
front of the base character in reverse order, and finally I can
substitute them with the appropriate TeX sequence.

It would certainly be far more convenient and efficient, if the "move
the combining characters in front of the base character" step could
already be done in the normalization routine, because it has already
split up the string in memory appropriately.


I think regular expressions must help you.
They are powerful and unrestricted.
(We may have a sequence of Hangul-Jamo-Initial + Hangul-Jamo-Medial +
Combining Character. In this case, we might be better to move
the combining character before Hangul-Jamo-Initial, 
not before Hangul-Jamo-Medial. 
But sorry, I don't know which is useful for TeX.)

#!perl
use charnames qw(:full);
use Unicode::UCD qw(charinfo);

$_ = "A\N{COMBINING GRAVE ACCENT} la fe\N{COMBINING CIRCUMFLEX ACCENT}te";

s/(\PM)(\pM)/$2$1/g;

print charinfo($_)->{name},"\n" for map ord, split //, $_;
__END__

COMBINING GRAVE ACCENT
LATIN CAPITAL LETTER A
SPACE
LATIN SMALL LETTER L
LATIN SMALL LETTER A
SPACE
LATIN SMALL LETTER F
COMBINING CIRCUMFLEX ACCENT
LATIN SMALL LETTER E
LATIN SMALL LETTER T
LATIN SMALL LETTER E

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


Regards,
SADAHIRO Tomoyuki

<Prev in Thread]	Current Thread	[Next in Thread>
5.8 roadmap and Encode, Dan Kogai Re: 5.8 roadmap and Encode, Autrijus Tang Re: 5.8 roadmap and Encode, SADAHIRO Tomoyuki Re: 5.8 roadmap and Encode, Autrijus Tang Re: 5.8 roadmap and Encode, Jarkko Hietaniemi Re: 5.8 roadmap and Encode, Jarkko Hietaniemi Re: 5.8 roadmap and Encode, Nick Ing-Simmons Re: 5.8 roadmap and Encode, Autrijus Tang Combining characters in front of base characters after normalization, Markus Kuhn Re: Combining characters in front of base characters after normalization, SADAHIRO Tomoyuki <=