perl-unicode

[FYI] string manipulation based on graphemes

2002-11-15 18:30:05
Hello.

Here is a regular expression for Unicode-aware Perl
(i.e. Perl 5.8.0 or later),
that matches a single Default Grapheme Cluster,
specified by Draft Unicode Technical Report #29, Version 3
(please see $Grapheme below).

cf. Default Grapheme Cluster Boundaries

   http://www.unicode.org/reports/tr29/tr29-3.html#Regular_Expressions

#!Perl

$Any     = qr/./s;
$CRLF    = qr/(?:\cM\cJ)/;
$Control = qr/[\p{Zl}\p{Zp}\p{Cc}\p{Cf}]/;
$Extend  = qr/[\p{Mn}\p{Me}\p{OtherGraphemeExtend}]/;

$HangL   = qr/[\x{1100}-\x{115F}]/;   # Hangul Jamo Leading Consonant
$HangV   = qr/[\x{1160}-\x{11A2}]/;   # Hangul Jamo Vowel
$HangT   = qr/[\x{11A8}-\x{11F9}]/;   # Hangul Jamo Trailing Consonant
$HangS   = qr/[\x{AC00}-\x{D7A3}]/;   # Hangul Syllable
$cHangLV = join '', map sprintf("\\x{%04X}", 0xAC00 + 28*$_), 0..19*21-1;
$HangLV  = qr/[$cHangLV]/;            # Hangul Syllable LV
$HangLVT = qr/(?:(?!$HangLV)$HangS)/; # Hangul Syllable LVT
$Hangul  = qr/(?:$HangL*(?:$HangLV$HangV*|$HangV+|$HangLVT)$HangT*
        | $HangL+ | $HangT+ )/x;

$Grapheme = qr/(?:$CRLF|$Control|(?:$Hangul|$Any)$Extend*)/;

=begin

My humble String::Multibyte, originally developped
for multiple-byte characters with an old, byte-oriented Perl,
now copes with multiple-character graphemes
powered by the newest Unicode support of Perl.

=cut

use 5.8.0;
use String::Multibyte;

$gop = String::Multibyte->new({
        charset => 'Grapheme-Oriented Perl',
        regexp  => $Grapheme,  # as above
    });

print "\x{AC00}\x{11A8}:\cM\cJ:\x{3042}:A\x{300}\x{301}:\cM:\0:\x{300}"
    eq join(':' =>
       $gop->strsplit("",
           "\x{AC00}\x{11A8}\cM\cJ\x{3042}A\x{300}\x{301}\cM\0\x{300}"))
    ? "ok" : "not ok", " 1\n";

print "\x{300}\0\cMA\x{300}\x{301}\x{3042}\cM\cJ\x{AC00}\x{11A8}"
    eq $gop->strrev(
        "\x{AC00}\x{11A8}\cM\cJ\x{3042}A\x{300}\x{301}\cM\0\x{300}")
    ? "ok" : "not ok", " 2\n";

__END__

NOTE:

    "\x{AC00}\x{11A8}" is a Hangul syllable cluster.
    "\cM\cJ" is CRLF, that must be a single grapheme.
    "A\x{300}\x{301}" is a combining character sequence.

the newest String::Multibyte
    http://search.cpan.org/author/SADAHIRO/String-Multibyte-1.01/

strsplit() works like split(), but not aware of a pattern.
    e.g. strsplit('*', 'a*bc**xyz') returns a list ('a', 'bc', '', 'xyz').
    So $gop->strsplit("", $string) does split a string into graphemes.

$gop->strrev() works like scalar(reverse()),
    but reverses a string grapheme-wise.

SADAHIRO Tomoyuki

<Prev in Thread] Current Thread [Next in Thread>