perl-unicode

[proposal] utility module for Hangul Syllables

2001-08-02 08:59:55
Hello, everyone.

This is a proposal of a new module.

Name : ??? 

SYNOPSIS

This module provides the following functions
to handle Hangul Syllables (and Jamo) in unicode.

  decomposeHangul
  composeHangul
  getHangulName
  parseHangulName

These functions must be useful for implementing 
 many things concerning unicode,

including charnames.pm, UnicodeCD.pm, ...,
  and Normalization and Collation modules in future.

DESCRIPTION

$decomposed_string = decomposeHangul($u_integer);
@u_integers = decomposeHangul($u_integer);

 ex.)
   decomposeHangul(0xAC00) # a CV syllable
      returns "\x{1100}\x{1161}"
            or (0x1100, 0x1161);
   decomposeHangul(0xAE00) # a CVC syllable
      returns "\x{1100}\x{1173}\x{11AF}"
            or (0x1100, 0x1173, 0x11AF);
   decomposeHangul(0x0041) # outside of Hangul Syllables
      returns empty string or empty list.

$hangul_composed_string = composeHangul($src_string);

  ex.)
   composeHangul("Hangul \x{1100}\x{1161}\x{1100}\x{1173}\x{11AF}")
    returns "Hangul \x{AC00}\x{AE00}";

   Any characters other than Hangul Jamo and Hangul Syllables
   are unaffected.

$name = getHangulName($u_integer);

  ex.)
   getHangulName(0xAC00) # a CV syllable
      returns "HANGUL SYLLABLE GA";
   getHangulName(0xAE00) # a CVC syllable
      returns "HANGUL SYLLABLE GEUL";
   getHangulName(0x0041) # outside of Hangul Syllables
      returns undef.

$u_integer = parseHangulName($name);

  ex.)
   parseHangulName("HANGUL SYLLABLE GA")
   or getHangulName("GA")   returns 0xAC00;

   parseHangulName("HANGUL SYLLABLE GEUL")
   or getHangulName("GEUL") returns 0xAE00;

   parse("LATIN SMALL LETTER A") returns undef.

  Caveat:

   parseHangulName("A") returns 0xC544
    as parseHangulName("HANGUL SYLLABLE A") does.
 
   but parseHangulName("G") returns undef
    because of the absence of "HANGUL SYLLABLE G".

IMPLEMENTATION

  cf. Annex 10: Hangul,
      in Unicode Normalization Forms (UTR #15)
      http://www.unicode.org/unicode/reports/tr15

  Algorithms for decomposeHangul, composeHangul,
  and getHangulName have been given in the UTR #15.

  Algorithm for parseHangulName is easy;

  The regex

   /^
     (?:HANGUL\ SYLLABLE\ )?
     ([^AEIOUWY]*)([AEIOUWY]+)([^AEIOUWY]*)
    $/x

  splits a syllable name into the corresponding
  short jamo names in the order of initial, medial, final.

  (BN: initial and final jamo names may be zero-length,
    cf. "HANGUL SYLLABLE WA")

  Then, if *all* the short jamo names are legal,
    the syllable name is legal. 

CAVEAT

  This module won't handle *all* about hangul,
  but only things 
  that are not included in Unicode.txt, NamesList.txt, etc.
  and must be derived from the argument by algorithm.

  I think passing a character outside Hangul syllable in
  shouldn't be carped or croaked,
  since it supposes the return value would be *always* checked.

regards,
SADAHIRO Tomoyuki
E-mail: bqw10602(_at_)nifty(_dot_)com

<Prev in Thread] Current Thread [Next in Thread>