Re: [not-yet-a-PATCH] compress Encode better

On Mon, Nov 04, 2002 at 03:26:16AM +0000, hv(_at_)crypt(_dot_)org wrote:

Nicholas Clark <nick(_at_)unfortu(_dot_)net> wrote:


:The default method is to see if my substring is already present somewhere,
:if so note where, if not append at the end. The (currently buggy) -O 
optimiser
:method also tries to see whether it can avoid appending the entire string to
:the end by looking for overlap at the start or the end. Clearly, I've not got
:that bit right yet, but I've run out of time tonight. Is there a better
:approximate algorithm that could find more space savings for more [or less 
:-)]
:CPU? I guess is analogous to trying to build a word-search table, but in 1D
:rather than 2D. (I'm hoping Hugo has experience of this sort of thing)

Not directly, no; I believe it is a question that has been studied,
however, and it feels like the sort of problem that would yield some
tricks at least for finding better local minima. Have you got any
details on how many substrings there are, and what sort of profile
of lengths they comprise?


I've hacked enc2xs (patch at end, if anyone wants to look themselves) to
calculate stats on the distribution of lengths.

:Meanwhile, here are hard numbers. enc2xs from Encode 1.80:
: 9386526 total
:
:Improved enc2xs:
: 5084037 total
:
:Improved enc2xs with AGGREGATE_TABLES
: 4706245 total

It might be worth seeing if we can easily get any useful estimates
for a theoretical lower bound, to give us an idea how much blood
can still be squeezed out.


Well, the absolute lower bound has to be for the unrealistic case of all
strings being substrings of the longest. However, for a more practical idea
of what's attainable, I suspect that we ought to be able to find an order
that gets a string block about the same length as the current block
compressed with Zlib.

$ perl5.8.0 ../JP/enc2xs-experiment -Q -O -o experiment.c -f sh_06_t.fnm 
Reading shiftjis (shiftjis)
Writing compiled form
32900 bytes in string tables
1724 bytes (95%) saved spotting duplicates
26 bytes (99.9%) saved using substrings
There were 4259 strings, total length 34650
Length  How many this long      How many bytes
2       2726    64.01%  64.01%  5452    15.73%  15.73%
3       2       0.05%   64.05%  6       0.02%   15.75%
4       886     20.80%  84.86%  3544    10.23%  25.98%
6       324     7.61%   92.46%  1944    5.61%   31.59%
8       109     2.56%   95.02%  872     2.52%   34.11%
10      59      1.39%   96.41%  590     1.70%   35.81%
12      24      0.56%   96.97%  288     0.83%   36.64%
14      10      0.23%   97.21%  140     0.40%   37.04%
15      2       0.05%   97.25%  30      0.09%   37.13%
16      10      0.23%   97.49%  160     0.46%   37.59%
20      3       0.07%   97.56%  60      0.17%   37.77%
21      1       0.02%   97.58%  21      0.06%   37.83%
24      2       0.05%   97.63%  48      0.14%   37.97%
30      4       0.09%   97.72%  120     0.35%   38.31%
31      1       0.02%   97.75%  31      0.09%   38.40%
32      2       0.05%   97.79%  64      0.18%   38.59%
34      2       0.05%   97.84%  68      0.20%   38.78%
36      2       0.05%   97.89%  72      0.21%   38.99%
40      1       0.02%   97.91%  40      0.12%   39.11%
45      1       0.02%   97.93%  45      0.13%   39.24%
48      2       0.05%   97.98%  96      0.28%   39.51%
60      2       0.05%   98.03%  120     0.35%   39.86%
62      1       0.02%   98.05%  62      0.18%   40.04%
66      1       0.02%   98.07%  66      0.19%   40.23%
69      1       0.02%   98.10%  69      0.20%   40.43%
78      2       0.05%   98.15%  156     0.45%   40.88%
96      2       0.05%   98.19%  192     0.55%   41.43%
100     1       0.02%   98.22%  100     0.29%   41.72%
110     1       0.02%   98.24%  110     0.32%   42.04%
111     1       0.02%   98.26%  111     0.32%   42.36%
126     1       0.02%   98.29%  126     0.36%   42.72%
128     1       0.02%   98.31%  128     0.37%   43.09%
138     1       0.02%   98.33%  138     0.40%   43.49%
153     1       0.02%   98.36%  153     0.44%   43.93%
189     35      0.82%   99.18%  6615    19.09%  63.02%
249     1       0.02%   99.20%  249     0.72%   63.74%
282     2       0.05%   99.25%  564     1.63%   65.37%
375     32      0.75%   100.00% 12000   34.63%  100.00%
Raw buffer is 34650 bytes, compressed is 28962 (16%)
Real buffer is 32900 bytes, compressed is 27637 (16%)

If it's not obvious, the second percentage column for each is cumulative.
Interestingly for my straw pole of Japanese encodings, there is a big
cumulative hike with the longest length:

perl5.8.0 ../JP/enc2xs-experiment -Q -O -o experiment.c -f eu_01_t.fnm 
Reading euc-jp (euc-jp)
Writing compiled form
78434 bytes in string tables
5446 bytes (93.5%) saved spotting duplicates
34 bytes (100%) saved using substrings
There were 8933 strings, total length 83914
...
276     1       0.01%   98.54%  276     0.33%   56.32%
279     1       0.01%   98.56%  279     0.33%   56.65%
282     129     1.44%   100.00% 36378   43.35%  100.00%
Raw buffer is 83914 bytes, compressed is 63253 (25%)
Real buffer is 78434 bytes, compressed is 58551 (25%)

The encodings in Byte (all 8 bit) show a smoother cumulative frequency
distribution, and better compression:

perl5.8.0 ../JP/enc2xs-experiment -Q -O -o experiment.c -f byte_t.fnm 
12764 bytes in string tables
4658 bytes (73.3%) saved spotting duplicates
3 bytes (100%) saved using substrings
There were 1263 strings, total length 17425
Length  How many this long      How many bytes
1       157     12.43%  12.43%  157     0.90%   0.90%
2       292     23.12%  35.55%  584     3.35%   4.25%
3       141     11.16%  46.71%  423     2.43%   6.68%
4       144     11.40%  58.12%  576     3.31%   9.99%
5       28      2.22%   60.33%  140     0.80%   10.79%
6       98      7.76%   68.09%  588     3.37%   14.16%
7       10      0.79%   68.88%  70      0.40%   14.57%
8       55      4.35%   73.24%  440     2.53%   17.09%
9       20      1.58%   74.82%  180     1.03%   18.12%
10      20      1.58%   76.41%  200     1.15%   19.27%
...
196     1       0.08%   99.45%  196     1.12%   90.28%
200     1       0.08%   99.52%  200     1.15%   91.43%
224     1       0.08%   99.60%  224     1.29%   92.72%
234     1       0.08%   99.68%  234     1.34%   94.06%
256     3       0.24%   99.92%  768     4.41%   98.47%
267     1       0.08%   100.00% 267     1.53%   100.00%
Raw buffer is 17425 bytes, compressed is 8163 (53%)
Real buffer is 12764 bytes, compressed is 7005 (45%)


This problem seems to similar to something N-P hard, and I believe shares a
useful property with N-P hard problems - it's easy to verify if a given
solution is correct. So what we could do is do the hard work to find the
optimal solution once, and package the optimal solution in the
distribution. Then all the build system needs to do is verify that the
shipped solution is correct, and if it is not, either (a) barf or (b) use
the current system of making a string.

Currently my algorithm sorts the substrings by length, longest first, and
then tries to "place" each in turn, in order

0: is the substring already present? (by good luck)
1: can we partially overlap the end?
2: can we partially overlap the start?
3: Damn. Append the whole substring to the end.

I think a brute force search to find the optimal solution basically has to
try putting all strings in all orders. I think it can find all permutations
only using steps 0,1 and 3. It can shortcut by keeping track of the best
solution it has to date, and if an intermediate solution gets too long it
can give up on that one and start again from "" with the next "first"
substring.

Nicholas Clark

--- ../bin/enc2xs       Mon Nov 25 03:32:08 2002
+++ enc2xs-experiment   Sat Dec 21 00:01:55 2002
@@ -8,6 +8,7 @@ BEGIN {
 use strict;
 use warnings;
 use Getopt::Std;
+use Compress::Zlib;
 my @orig_ARGV = @ARGV;
 our $VERSION  = do { my @r = (q$Revision: 1.31 $ =~ /\d+/g); sprintf 
"%d."."%02d" x $#r, @r };
 
@@ -211,6 +212,7 @@ my %strings_in_acc;
 my $saved = 0;
 my $subsave = 0;
 my $strings = 0;
+my (@strings, $raw);
 
 sub cmp_name
 {
@@ -338,7 +340,37 @@ END
     $saved, $perc_saved              if $saved;
   printf STDERR "%d bytes (%.3g%%) saved using substrings\n",
     $subsave, $perc_subsaved         if $subsave;
- }
+
+  my $how_many;
+  my $length;
+  for (my $i = 0; $i < @strings; ++$i) {
+    $strings[$i] ||= 0;
+    $how_many += $strings[$i];
+    $length += $strings[$i] * $i;
+  }
+
+  print STDERR "There were $how_many strings, total length $length\n";
+  print STDERR "Length\tHow many this long\tHow many bytes\n";
+  my ($cumulative_number, $cumulative_total);
+  for (my $i = 1; $i < @strings; ++$i) {
+    next unless $strings[$i];
+    my $number = $strings[$i];
+    my $total = $number * $i;
+    $cumulative_number += $number;
+    $cumulative_total += $total;
+    printf STDERR "$i\t$number\t%.02f%%\t%.02f%%\t$total\t%.02f%%\t%.02f%%\n",
+      100 * $number / $how_many, 100 * $cumulative_number / $how_many,
+       100 * $total / $length, 100 * $cumulative_total / $length;
+  }
+  my $raw_zl = length compress ($raw);
+  my $acc_zl = length compress ($string_acc);
+
+  my $raw_l = length $raw;
+  printf STDERR "Raw buffer is $raw_l bytes, compressed is $raw_zl (%.0f%%)\n",
+    100 * ($raw_l - $raw_zl) / $raw_l;
+  printf STDERR "Real buffer is $strings bytes, compressed is $acc_zl 
(%.0f%%)\n",
+    100 * ($strings - $acc_zl) / $strings;
+}
 elsif ($doEnc)
  {
   foreach my $name (sort cmp_name keys %encoding)
@@ -705,6 +737,7 @@ sub outbigstring
   # that this makes it more likely that we find the short strings later on.
   # Not sure if it helps sorting strings of the same length lexcically.
   foreach my $s (sort {length $b <=> length $a || $a cmp $b} keys %strings) {
+    $raw .= $s; $strings[length $s]++;
     my $index = index $string_acc, $s;
     if ($index >= 0) {
       $saved += length($s);