[not-yet-a-PATCH] compress Encode better

I've been experimenting with how enc2xs builds the C tables that turn into the
shared objects. enc2xs is building tables (arrays of struct encpage_t) which
in turn have pointers to blocks of bytes.

The way Nick I-S originally set it up, these blocks of bytes are named after
the part encoding transformation they represent. His output routine had two
levels of space optimisation. It always looked to see if the block of bytes it
was about to output was an exact copy of a block it had already output, and
if so it simply aliased the name of the second block to that of the first.
Additionally one can specify an -O flag to enc2xs, which turns on a brute
force substring search. This looks to see whether anything about to be output
is a substring of an existing block of bytes, and if so outputs an alias to
that offset.

However, the upshot of all this is that enc2xs generates C files with a large
number of moderate to small unsigned char arrays holding runs of bytes.
Instead, I wondered what would be the effect of concatenating all the needed
byte sequences together into one long string, output this as a single C array,
and then make all encpage_t references be offsets into this array.

The results seem to be very promising. On x86 FreeBSD, I find the sum of the
sizes of the shared object files drops by 46%. If I export AGGREGATE_TABLES=1
to make Makefile.PL make enc2xs compile files in aggregate mode I get the size
saving up to 50%. I've not looked to see where this saving is coming from,
but I presume that gcc opts to align the start of character arrays in some
way, so having the same number of bytes split into lots of strings means more
wasted space. The change will be getting some actual byte savings over the
existing system, as my continuous string lets me do Nick's -O substring search
at no extra cost for all encodings, whereas the current Makefile.pl doesn't
enable this for the non-European encodings.

Currently the appended patch passes all regression tests on FreeBSD on
bleadperl. However, having experimented I know that the new -O function it
provides is buggy in some way, as running -O on the Chinese encodings gives
regression test errors. (so don't apply it yet). I've not looked at what the
Encode regression tests actually do, so I don't know how thoroughly they
check whether the transformations are actually correct. In other words,
done correctly this approach *will* generate the same transformation tables
as before, and although I *think* I'm doing it correctly (without the -O;
patches welcome) I'm not certain of this.

I presume that finding the shortest string that has a list of other strings
as substrings is a hard problem (for some formal definition of "hard").
Currently I'm simply sorting all the strings I have into size order, and
building up my long string starting with the longest substring I need.
The default method is to see if my substring is already present somewhere,
if so note where, if not append at the end. The (currently buggy) -O optimiser
method also tries to see whether it can avoid appending the entire string to
the end by looking for overlap at the start or the end. Clearly, I've not got
that bit right yet, but I've run out of time tonight. Is there a better
approximate algorithm that could find more space savings for more [or less :-)]
CPU? I guess is analogous to trying to build a word-search table, but in 1D
rather than 2D. (I'm hoping Hugo has experience of this sort of thing)

Meanwhile, here are hard numbers. enc2xs from Encode 1.80:

  222352 18075-32/lib/auto/Encode/Byte/Byte.so
 2059045 18075-32/lib/auto/Encode/CN/CN.so
   28532 18075-32/lib/auto/Encode/EBCDIC/EBCDIC.so
 2687896 18075-32/lib/auto/Encode/JP/JP.so
 2314555 18075-32/lib/auto/Encode/KR/KR.so
   37425 18075-32/lib/auto/Encode/Symbol/Symbol.so
 2024682 18075-32/lib/auto/Encode/TW/TW.so
   12039 18075-32/lib/auto/Encode/Unicode/Unicode.so
 9386526 total

Improved enc2xs:

  190853 18075-Encode-O0/lib/auto/Encode/Byte/Byte.so
 1119692 18075-Encode-O0/lib/auto/Encode/CN/CN.so
   23003 18075-Encode-O0/lib/auto/Encode/EBCDIC/EBCDIC.so
 1351823 18075-Encode-O0/lib/auto/Encode/JP/JP.so
 1252329 18075-Encode-O0/lib/auto/Encode/KR/KR.so
   31947 18075-Encode-O0/lib/auto/Encode/Symbol/Symbol.so
 1102351 18075-Encode-O0/lib/auto/Encode/TW/TW.so
   12039 18075-Encode-O0/lib/auto/Encode/Unicode/Unicode.so
 5084037 total

Improved enc2xs with AGGREGATE_TABLES

  190853 18075-Encode-O0-Agg/lib/auto/Encode/Byte/Byte.so
 1050477 18075-Encode-O0-Agg/lib/auto/Encode/CN/CN.so
   23003 18075-Encode-O0-Agg/lib/auto/Encode/EBCDIC/EBCDIC.so
 1281004 18075-Encode-O0-Agg/lib/auto/Encode/JP/JP.so
 1179594 18075-Encode-O0-Agg/lib/auto/Encode/KR/KR.so
   31947 18075-Encode-O0-Agg/lib/auto/Encode/Symbol/Symbol.so
  937328 18075-Encode-O0-Agg/lib/auto/Encode/TW/TW.so
   12039 18075-Encode-O0-Agg/lib/auto/Encode/Unicode/Unicode.so
 4706245 total

Nicholas Clark
-- 
XSLT better than perl?  http://www.perl.org/advocacy/spoofathon/

--- ext/Encode/bin/enc2xs.orig  Sat Jun  1 19:33:03 2002
+++ ext/Encode/bin/enc2xs       Sun Nov  3 20:29:20 2002
@@ -6,6 +6,7 @@ BEGIN {
     require Config; import Config;
 }
 use strict;
+use warnings;
 use Getopt::Std;
 my @orig_ARGV = @ARGV;
 our $VERSION  = do { my @r = (q$Revision: 1.30 $ =~ /\d+/g); sprintf 
"%d."."%02d" x $#r, @r };
@@ -186,7 +187,7 @@ END
     print C "#include <XSUB.h>\n";
     print C "#define U8 U8\n";
    }
-  print C "#include \"encode.h\"\n";
+  print C "#include \"encode.h\"\n\n";
 
  }
 elsif ($cname =~ /\.enc$/)
@@ -204,6 +205,9 @@ elsif ($cname =~ /\.pet$/)
 
 my %encoding;
 my %strings;
+my $string_acc;
+my %strings_in_acc;
+
 my $saved = 0;
 my $subsave = 0;
 my $strings = 0;
@@ -250,8 +254,19 @@ if ($doC)
   foreach my $name (sort cmp_name keys %encoding)
    {
     my ($e2u,$u2e,$erep,$min_el,$max_el) = @{$encoding{$name}};
-    output(\*C,$name.'_utf8',$e2u);
-    output(\*C,'utf8_'.$name,$u2e);
+    process($name.'_utf8',$e2u);
+    addstrings(\*C,$e2u);
+
+    process('utf8_'.$name,$u2e);
+    addstrings(\*C,$u2e);
+   }
+  outbigstring(\*C,"enctable");
+  foreach my $name (sort cmp_name keys %encoding)
+   {
+    my ($e2u,$u2e,$erep,$min_el,$max_el) = @{$encoding{$name}};
+    outtable(\*C,$e2u, "enctable");
+    outtable(\*C,$u2e, "enctable");
+
     # push(@{$encoding{$name}},outstring(\*C,$e2u->{Cname}.'_def',$erep));
    }
   foreach my $enc (sort cmp_name keys %encoding)
@@ -596,43 +611,6 @@ sub enter_fb0 {
   }
 }
 
-
-sub outstring
-{
- my ($fh,$name,$s) = @_;
- my $sym = $strings{$s};
- if ($sym)
-  {
-   $saved += length($s);
-  }
- else
-  {
-   if ($opt{'O'}) {
-       foreach my $o (keys %strings)
-        {
-         next unless (my $i = index($o,$s)) >= 0;
-         $sym = $strings{$o};
-         # gcc things that 0x0e+0x10 (anything with e+) starts to look like
-         # a hexadecimal floating point constant. Silly gcc. Only p
-         # introduces a floating point constant. Put the space in to stop it
-         # getting confused.
-         $sym .= sprintf(" +0x%02x",$i) if ($i);
-         $subsave += length($s);
-         return $strings{$s} = $sym;
-       }
-   }
-   $strings{$s} = $sym = $name;
-   $strings += length($s);
-   my $definition = sprintf "static const U8 %s[%d] = { ",$name,length($s);
-   # Maybe we should assert that these are all <256.
-   $definition .= join(',',unpack "C*",$s);
-   # We have a single long line. Split it at convenient commas.
-   $definition =~ s/(.{74,77},)/$1\n/g;
-   print $fh "$definition };\n\n";
-  }
- return $sym;
-}
-
 sub process
 {
   my ($name,$a) = @_;
@@ -693,7 +671,8 @@ sub process
   $a->{'Entries'} = \(_at_)ent;
 }
 
-sub outtable
+
+sub addstrings
 {
  my ($fh,$a) = @_;
  my $name = $a->{'Cname'};
@@ -701,20 +680,98 @@ sub outtable
  foreach my $b (@{$a->{'Entries'}})
   {
    next unless $b->[AGG_OUT_LEN];
-   my $s = $b->[AGG_MIN_IN];
-   my $e = $b->[AGG_MAX_IN];
-   outstring($fh,sprintf("%s__%02x_%02x",$name,$s,$e),$b->[AGG_OUT_BYTES]);
+   $strings{$b->[AGG_OUT_BYTES]} = undef;
   }
  if ($a->{'Forward'})
   {
    my $var = $^O eq 'MacOS' ? 'extern' : 'static';
-   print $fh "\n$var encpage_t $name\[",scalar(@{$a->{'Entries'}}),"];\n";
+   print $fh "$var encpage_t $name\[",scalar(@{$a->{'Entries'}}),"];\n";
   }
+ $a->{'DoneStrings'} = 1;
+ foreach my $b (@{$a->{'Entries'}})
+  {
+   my ($s,$e,$out,$t,$end,$l) = @$b;
+   addstrings($fh,$t) unless $t->{'DoneStrings'};
+  }
+}
+
+sub outbigstring
+{
+  my ($fh,$name) = @_;
+
+  $string_acc = '';
+
+  # Make the big string in the string accumulator. Longest first, on the hope
+  # that this makes it more likely that we find the short strings later on.
+  # Not sure if it helps sorting strings of the same length lexcically.
+  foreach my $s (sort {length $b <=> length $a || $a cmp $b} keys %strings) {
+    my $index = index $string_acc, $s;
+    if ($index >= 0) {
+      $saved += length($s);
+      $strings_in_acc{$s} = $index;
+    } else {
+    OPTIMISER: {
+       if ($opt{'O'}) {
+         my $sublength = length $s;
+         while (--$sublength > 0) {
+           # progressively lop characters off the end, to see if the start of
+           # the new string overlaps the end of the accumulator.
+           if (substr ($string_acc, -$sublength)
+               eq substr ($s, 0, $sublength)) {
+             $subsave += $sublength;
+             $strings_in_acc{$s} = length ($string_acc) - $sublength;
+             # append the last bit on the end.
+             $string_acc .= substr ($s, $sublength);
+             last OPTIMISER;
+           }
+           # or if the end of the new string overlaps the start of the
+           # accumulator
+           next unless substr ($string_acc, 0, $sublength)
+             eq substr ($s, -$sublength);
+           # well, the last $sublength characters of the accumulator match.
+           # so as we're prepending to the accumulator, need to shift all our
+           # existing offsets forwards
+           $_ += $sublength foreach values %strings_in_acc;
+           $subsave += $sublength;
+           $strings_in_acc{$s} = 0;
+           # append the first bit on the start.
+           $string_acc = substr ($s, 0, -$sublength) . $string_acc;
+           last OPTIMISER;
+         }
+       }
+       # Optimiser (if it ran) found nothing, so just going have to tack the
+       # whole thing on the end.
+       $strings_in_acc{$s} = length $string_acc;
+       $string_acc .= $s;
+      };
+    }
+  }
+
+  $strings = length $string_acc;
+  my $definition = "\nstatic const U8 $name\[$strings] = { " .
+    join(',',unpack "C*",$string_acc);
+  # We have a single long line. Split it at convenient commas.
+  print $fh $1, "\n" while $definition =~ /\G(.{74,77},)/gcs;
+  print $fh substr ($definition, pos $definition), " };\n";
+}
+
+sub findstring {
+  my ($name,$s) = @_;
+  my $offset = $strings_in_acc{$s};
+  die "Can't find string " . join (',',unpack "C*",$s) . " in accumulator"
+    unless defined $offset;
+  "$name + $offset";
+}
+
+sub outtable
+{
+ my ($fh,$a,$bigname) = @_;
+ my $name = $a->{'Cname'};
  $a->{'Done'} = 1;
  foreach my $b (@{$a->{'Entries'}})
   {
    my ($s,$e,$out,$t,$end,$l) = @$b;
-   outtable($fh,$t) unless $t->{'Done'};
+   outtable($fh,$t,$bigname) unless $t->{'Done'};
   }
  print $fh "\nstatic encpage_t $name\[",scalar(@{$a->{'Entries'}}),"] = {\n";
  foreach my $b (@{$a->{'Entries'}})
@@ -724,7 +781,7 @@ sub outtable
    print  $fh "{";
    if ($l)
     {
-     printf $fh outstring($fh,'',$out);
+     printf $fh findstring($bigname,$out);
     }
    else
     {
@@ -736,14 +793,6 @@ sub outtable
  print $fh "};\n";
 }
 
-sub output
-{
- my ($fh,$name,$a) = @_;
- process($name,$a);
- # Sub-tables
- outtable($fh,$a);
-}
-
 sub output_enc
 {
  my ($fh,$name,$a) = @_;
@@ -857,7 +906,7 @@ use vars qw(
 );
 
 sub find_e2x{
-    eval { require File::Find };
+    eval { require File::Find; };
     my (@inc, %e2x_dir);
     for my $inc (@INC){
        push @inc, $inc unless $inc eq '.'; #skip current dir
@@ -869,6 +918,7 @@ sub find_e2x{
                     = lstat($_) or return;
                 -f _ or return;
                 if (/^.*\.e2x$/o){
+                    no warnings 'once';
                     $e2x_dir{$File::Find::dir} ||= $mtime;
                 }
                 return;
@@ -927,6 +977,7 @@ sub make_configlocal_pm
            eval { require "Encode/$f"; };
            $@ and die "Can't require Encode/$f: $(_at_)\n";
            for my $enc (Encode->encodings()){
+               no warnings 'once';
                $in_core{$enc} and next;
                $Encode::Config::ExtModule{$enc} and next;
                my $mod = "Encode/$f";