Hello.
Here is a regular expression for Unicode-aware Perl
(i.e. Perl 5.8.0 or later),
that matches a single Default Grapheme Cluster,
specified by Draft Unicode Technical Report #29, Version 3
(please see $Grapheme below).
cf. Default Grapheme Cluster Boundaries
http://www.unicode.org/reports/tr29/tr29-3.html#Regular_Expressions
#!Perl
$Any = qr/./s;
$CRLF = qr/(?:\cM\cJ)/;
$Control = qr/[\p{Zl}\p{Zp}\p{Cc}\p{Cf}]/;
$Extend = qr/[\p{Mn}\p{Me}\p{OtherGraphemeExtend}]/;
$HangL = qr/[\x{1100}-\x{115F}]/; # Hangul Jamo Leading Consonant
$HangV = qr/[\x{1160}-\x{11A2}]/; # Hangul Jamo Vowel
$HangT = qr/[\x{11A8}-\x{11F9}]/; # Hangul Jamo Trailing Consonant
$HangS = qr/[\x{AC00}-\x{D7A3}]/; # Hangul Syllable
$cHangLV = join '', map sprintf("\\x{%04X}", 0xAC00 + 28*$_), 0..19*21-1;
$HangLV = qr/[$cHangLV]/; # Hangul Syllable LV
$HangLVT = qr/(?:(?!$HangLV)$HangS)/; # Hangul Syllable LVT
$Hangul = qr/(?:$HangL*(?:$HangLV$HangV*|$HangV+|$HangLVT)$HangT*
| $HangL+ | $HangT+ )/x;
$Grapheme = qr/(?:$CRLF|$Control|(?:$Hangul|$Any)$Extend*)/;
=begin
My humble String::Multibyte, originally developped
for multiple-byte characters with an old, byte-oriented Perl,
now copes with multiple-character graphemes
powered by the newest Unicode support of Perl.
=cut
use 5.8.0;
use String::Multibyte;
$gop = String::Multibyte->new({
charset => 'Grapheme-Oriented Perl',
regexp => $Grapheme, # as above
});
print "\x{AC00}\x{11A8}:\cM\cJ:\x{3042}:A\x{300}\x{301}:\cM:\0:\x{300}"
eq join(':' =>
$gop->strsplit("",
"\x{AC00}\x{11A8}\cM\cJ\x{3042}A\x{300}\x{301}\cM\0\x{300}"))
? "ok" : "not ok", " 1\n";
print "\x{300}\0\cMA\x{300}\x{301}\x{3042}\cM\cJ\x{AC00}\x{11A8}"
eq $gop->strrev(
"\x{AC00}\x{11A8}\cM\cJ\x{3042}A\x{300}\x{301}\cM\0\x{300}")
? "ok" : "not ok", " 2\n";
__END__
NOTE:
"\x{AC00}\x{11A8}" is a Hangul syllable cluster.
"\cM\cJ" is CRLF, that must be a single grapheme.
"A\x{300}\x{301}" is a combining character sequence.
the newest String::Multibyte
http://search.cpan.org/author/SADAHIRO/String-Multibyte-1.01/
strsplit() works like split(), but not aware of a pattern.
e.g. strsplit('*', 'a*bc**xyz') returns a list ('a', 'bc', '', 'xyz').
So $gop->strsplit("", $string) does split a string into graphemes.
$gop->strrev() works like scalar(reverse()),
but reverses a string grapheme-wise.
SADAHIRO Tomoyuki