perl-unicode

Re: [ANNOUNCE] Unicode::Normalize 0.21 and ::Collate 0.24 released

2003-04-07 09:30:06

Could you add additional normalization for Korean Hangul Jamos
as outlined at http://jshin.net/i18n/korean/jamocomp.html ?
There are six groups in the page, but only three of them are relevant
to Unicode normalization ('LC Clusters Basic', 'VO Clusters Basic',
and 'TC Clusters Basic'). The rest are for opentype fonts that
include additional glyphs for Jamo clusters not directly encoded
in Unicode ('NPF' stands for 'No Presentation Form' encoded.)

Because UTC removed composition/decomposition of complex/cluster
Jamos out of/into sequences of basic/simple Jamos per the
request of South Korean std. body (one of the stupidest acts)
between 2.0 and 3.0, they're not a part of Unicode normalization
and will never be because the normalization is frozen for
existing characters. However, UTC is now considering
introduction of tailored normalization and if they do,
the list would be among the first to be added as 'tailored
normalization' (for Korean.).

In summary, could you make your normalization package
offer a way to specify 'tailoring' (or some kind of
optional normalization)?

I looked up Jamo cluster compositions/decompositions a bit,
but they seem not to be conforming with the algorithm of UAX #15.
 ( http://www.unicode.org/unicode/reports/tr15/ )
This is a knotty problem.

E.g. on the following decomposition mappings:

EO (U+1165) + I (U+1175) ==> E (U+1166)
O (U+1169) + EO (U+1165) ==> O-EO (U+117F)
O (U+1169) + E (U+1166)  ==> O-E (U+1180)

According to the algorithm of UAX #15,
the full decomposition of

O-E (U+1180) must be O (U+1169) + EO (U+1165) + I (U+1175),

and the decomposition mapping of

O-E (U+1180) must be O-EO (U+117F) + I (U+1175).

That is, O-E => O + E => O + (EO + I) => (O + EO) + I => O-EO + I.


PS. IMO, so that any function would be integrated
in Unicode::Normalize, its feature should be
specified, mentioned, or suggested in UAX #15.


PS2. Jamo Composition (&composeJamo) may be easily implemented
by something like the following codelet.

sub InJamoL { "1100\t1159\n115F\n" }
sub InJamoV { "1160\t11A2\n" }
sub InJamoT { "11A8\t11F9\n" }

sub composeJamo {
  my $str = shift;
  $str =~ s/(\p{InJamoL}{2,})/getCompositionOfJamoL($1)/eg;
  $str =~ s/(\p{InJamoV}{2,})/getCompositionOfJamoV($1)/eg;
  $str =~ s/(\p{InJamoT}{2,})/getCompositionOfJamoT($1)/eg;
  return $str;
}

sub getCompositionOfJamoL {
  my $key = shift;
  exists $HashOfCompositionOfJamoL{$key}
       ? $HashOfCompositionOfJamoL{$key} : $key;
}

%HashOfCompositionOfJamoL = (
  "\x{1107}\x{1107}\x{110B}" => "\x{112C}",
  "\x{1100}\x{1100}" => "\x{1101}",
  # etc.
);
# add similarly for JamoV and JamoL, and definitions of hashes.

SADAHIRO Tomoyuki