On Monday, April 1, 2002, at 09:08 , Dan Kogai wrote:
On Monday, April 1, 2002, at 08:40 , Nick Ing-Simmons wrote:
I have recently found this undocumented feature but dared not use
it.
I was not aware it was actually implemented ;-)
Well, half of it. the regex that catches multiple <U...> was there but
only the first one was used and the multiple occurance of <U...> croaks
with a "Bad line:" message. But this error was good enough for me to
find where to fix.
And here is the quick fix to enc2xs that allows multiple occurance of
<U...>. It's slightly faster too because there is no backtracking.
--- bin/enc2xs 2002/03/31 21:00:50 1.10
+++ bin/enc2xs 2002/04/01 12:55:37
@@ -381,16 +381,15 @@
s/#.*$//;
last if /^\s*END\s+CHARMAP\s*$/i;
next if /^\s*$/;
- my ($u,@byte);
- my $fb = '';
- $u = $1 if (/^<U([0-9a-f]+)>\s+/igc);
- push(@byte,$1) while /\G\\x([0-9a-f]+)/igc;
- $fb = $1 if /\G\s*(\|[0-3])/gc;
- # warn "$_: $u @byte | $fb\n";
- die "Bad line:$_" unless /\G\s*(#.*)?$/gc;
- if (defined($u))
+ my (@uni, @byte) = ();
+ my ($uni, $byte, $fb) = m/^(\S+)\s+(\S+)\s+(\S+)\s+/o
+ or die "Bad line: $_";
+ push @uni, $1 while ($uni =~ m/\G<U([0-9a-fA-F]+)>/g);
+ # warn join(",", @uni);
+ push @byte, $1 while ($byte =~ m/\G\\x([0-9a-fA-F]+)/g);
+ if (@uni)
{
- my $uch = encode_U(hex($u));
+ my $uch = join('', map { encode_U(hex($_)) } @uni );
my $ech = join('',map(chr(hex($_)),@byte));
my $el = length($ech);
$max_el = $el if (!defined($max_el) || $el > $max_el);
The quick test against freshly brew macJapan.ucm (freshly created out
of JAPANESE.txt at unicode.org) has shown it is working.
I think it looks better if it were written as
<UNNNN+UMMMM> \xYY\xYY ....
I don't like the <UNNNN+UMMMM> part it will make the parsing messier.
The \xYY\xYY is of course what I meant ;-)
Not that much. It's just a regex after all. Let's TIMTOWTDI it.
<U...><U...> has already been working. <U...+U...> soon to come.
Dan the Encode Maintainer