perl-unicode

[Encode] Compound Unicode Character Support in UCM

2002-04-01 06:04:21
On Monday, April 1, 2002, at 09:08 , Dan Kogai wrote:
On Monday, April 1, 2002, at 08:40 , Nick Ing-Simmons wrote:
I have recently found this undocumented feature but dared not use it.

I was not aware it was actually implemented ;-)

Well, half of it. the regex that catches multiple <U...> was there but only the first one was used and the multiple occurance of <U...> croaks with a "Bad line:" message. But this error was good enough for me to find where to fix.

And here is the quick fix to enc2xs that allows multiple occurance of <U...>. It's slightly faster too because there is no backtracking.

--- bin/enc2xs  2002/03/31 21:00:50     1.10
+++ bin/enc2xs  2002/04/01 12:55:37
@@ -381,16 +381,15 @@
    s/#.*$//;
    last if /^\s*END\s+CHARMAP\s*$/i;
    next if /^\s*$/;
-   my ($u,@byte);
-   my $fb = '';
-   $u = $1 if (/^<U([0-9a-f]+)>\s+/igc);
-   push(@byte,$1) while /\G\\x([0-9a-f]+)/igc;
-   $fb = $1 if /\G\s*(\|[0-3])/gc;
-   # warn "$_: $u @byte | $fb\n";
-   die "Bad line:$_" unless /\G\s*(#.*)?$/gc;
-   if (defined($u))
+   my (@uni, @byte) = ();
+   my ($uni, $byte, $fb) = m/^(\S+)\s+(\S+)\s+(\S+)\s+/o
+       or die "Bad line: $_";
+   push @uni, $1  while ($uni =~  m/\G<U([0-9a-fA-F]+)>/g);
+   # warn join(",", @uni);
+   push @byte, $1 while ($byte =~ m/\G\\x([0-9a-fA-F]+)/g);
+   if (@uni)
     {
-     my $uch = encode_U(hex($u));
+     my $uch =  join('', map { encode_U(hex($_)) } @uni );
      my $ech = join('',map(chr(hex($_)),@byte));
      my $el  = length($ech);
      $max_el = $el if (!defined($max_el) || $el > $max_el);

The quick test against freshly brew macJapan.ucm (freshly created out of JAPANESE.txt at unicode.org) has shown it is working.


  I think it looks better if it were written as

<UNNNN+UMMMM> \xYY\xYY ....

I don't like the <UNNNN+UMMMM> part it will make the parsing messier.

The \xYY\xYY is of course what I meant ;-)

Not that much.  It's just a regex after all.  Let's TIMTOWTDI it.

  <U...><U...> has already been working.  <U...+U...> soon to come.

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>