perl-unicode

Re: should \d match *all* the digits? faster with woyka

1999-08-11 10:45:41
: On Wed, Aug 11, 1999 at 08:52:46AM -0500, A D wrote:
: > Hello Larry
: > 
: > PLease read this woyka, it can speed perl 100s time
: > and revolutionize the perl engine and the unicode then.
: > Please let me know in due time what you think

It won't speed up anything 100s of times.  It might speed up some
applications 2 or 3 times.  You have to realize that traditional
orthography is already somewhat Huffman encoded--the common words are
already shorter.  The word "the" is three letters.  The word "a" is
already one letter.  By the time you figure out how to encode spaces
and punctuation, the encoding itself is only going to give 50-60%
compression at the most.

The other aspect is that you don't have to spend any time finding word
boundaries.  For an application that is interested in word boundaries,
that's a win, but for applications that aren't, it's not.  Indeed, in
the general case, any application that is interested in individual
characters will be slowed down, because you'd have to decode the word
to characters internally.

Still, for some applications, this would be a reasonable optimization.

Tim Bunce writes:
: Doesn't seem particularly revolutionary. Many people, including myself,
: have already spoken of using UTF8 'characters' to represent arbitary
: encodings and using regular expressions to search and manipulate them.

Yes.

: I do agree that it's a powerful concept that could have wide applications.
: Someone just needs to do the leg work and create a module to make it
: easy to use.

The question is how far you have to go with this.  Since unicode is compatible
with ascii, you can still say

    use utf8;
    print "foo\n";

But a utf8 encoding applies to all the strings in its scope.  What should

    use utf8 'woyka_english';
    print "foo\n";

do?  Encode "foo\n" into woykan, probably, and reverse translate on print.

Or not...

Larry