perl-unicode

Re: [not-yet-a-PATCH] compress Encode better

2002-11-03 23:31:08
Nicholas Clark <nick(_at_)unfortu(_dot_)net> wrote:
:I've been experimenting with how enc2xs builds the C tables that turn into the
:shared objects. enc2xs is building tables (arrays of struct encpage_t) which
:in turn have pointers to blocks of bytes.

Great, you seem to be getting some excellent results.

I have also wondered whether the .ucm files are needed after these
have been built; if not, we should consider supplying with perl only
the optimised table data if that could give us a space saving in the
distribution - it would cut build time significantly as well as
allowing us to consider algorithms that take much longer over the
table optimisation, since they need be run only once when we
integrate updated .ucm files.

Hmm, I wonder how distributable an optimal algorithm could be, and
how many SETI-hours it would take to run? :)

:The default method is to see if my substring is already present somewhere,
:if so note where, if not append at the end. The (currently buggy) -O optimiser
:method also tries to see whether it can avoid appending the entire string to
:the end by looking for overlap at the start or the end. Clearly, I've not got
:that bit right yet, but I've run out of time tonight. Is there a better
:approximate algorithm that could find more space savings for more [or less :-)]
:CPU? I guess is analogous to trying to build a word-search table, but in 1D
:rather than 2D. (I'm hoping Hugo has experience of this sort of thing)

Not directly, no; I believe it is a question that has been studied,
however, and it feels like the sort of problem that would yield some
tricks at least for finding better local minima. Have you got any
details on how many substrings there are, and what sort of profile
of lengths they comprise?

:Meanwhile, here are hard numbers. enc2xs from Encode 1.80:
: 9386526 total
:
:Improved enc2xs:
: 5084037 total
:
:Improved enc2xs with AGGREGATE_TABLES
: 4706245 total

It might be worth seeing if we can easily get any useful estimates
for a theoretical lower bound, to give us an idea how much blood
can still be squeezed out.

Hugo