perl-unicode

Re: should \d match *all* the digits? faster with woyka

1999-08-11 07:21:14
Hello Larry

PLease read this woyka, it can speed perl 100s time
and revolutionize the perl engine and the unicode then.
Please let me know in due time what you think

Thanks
Alan


----Original Message Follows----
From: Larry Wall <larry(_at_)wall(_dot_)org>
To: jhi(_at_)iki(_dot_)fi
CC: perl-unicode(_at_)perl(_dot_)org, dmulholl(_at_)cs(_dot_)indiana(_dot_)edu, 
larry(_at_)wall(_dot_)org
Subject: Re: should \d match *all* the digits?
Date: Tue, 10 Aug 1999 12:51:21 -0700 (PDT)

jhi(_at_)iki(_dot_)fi writes:
: Hi, all you Unicoders.
:
: Daniel Yacob who has submitted a couple of patches for the Perl Unicode
: support (especially patches related to syllabaries like Ethiopic and
: several Amer-Indian languages), expressed wonderment over the fact
: that currently the \d a.k.a. \p{IsDigit} a.k.a. [[:digit:]] does not
: match *all* the digits, that is, digits not only the 0-9 beckoning
: to us from the ASCII world, but also things like
:
: 00B2;SUPERSCRIPT TWO;No;0;EN;<super> 0032;2;2;2;N;SUPERSCRIPT DIGIT TWO;;;; : 00B3;SUPERSCRIPT THREE;No;0;EN;<super> 0033;3;3;3;N;SUPERSCRIPT DIGIT THREE;;;; : 00B9;SUPERSCRIPT ONE;No;0;EN;<super> 0031;1;1;1;N;SUPERSCRIPT DIGIT ONE;;;;
: ...
: 0966;DEVANAGARI DIGIT ZERO;Nd;0;L;;0;0;0;N;;;;;
: 0967;DEVANAGARI DIGIT ONE;Nd;0;L;;1;1;1;N;;;;;
: 0968;DEVANAGARI DIGIT TWO;Nd;0;L;;2;2;2;N;;;;;
: 0969;DEVANAGARI DIGIT THREE;Nd;0;L;;3;3;3;N;;;;;
: ...
: 1369;ETHIOPIC DIGIT ONE;Nd;0;L;;1;1;1;N;;;;;
: 136A;ETHIOPIC DIGIT TWO;Nd;0;L;;2;2;2;N;;;;;
: 136B;ETHIOPIC DIGIT THREE;Nd;0;L;;3;3;3;N;;;;;
: 136C;ETHIOPIC DIGIT FOUR;Nd;0;L;;4;4;4;N;;;;;
: ...
:
: What say you?

The intent was that \d match all decimal digits (Nd), but not other
numbers (No), such as superscripts.  Basically, can you do a tr///
on \d+ and feed it to atoi() meaningfully?

See section 4.6 in the Unicode Standard 2.0 for more details.  I don't
know if this has been modified for 3.0.

Larry



_______________________________________________________________
Get Free Email and Do More On The Web. Visit http://www.msn.com
WORDS BEAT THE NUMBERS
Dr. Woyka

Language studies inspired a new look at the binary system of storing words on computer memory – JANE BIRD reports

Greek Drama many not sound a likely starting point for a radical breakthrough in computer design, but that is where 65 year old Graham Woyka got the idea for his “information engine”. While investigating the use of metaphor by such writers as Aeschylus and Aristophanes, he became fascinated by the power and flexibility of human language.

So Woyka, a mathematician and chemical engineer by training, designed a computer that can store massive amounts of text in a highly compressed form and read it at lightning speed, performing exhaustive searches for words or phrases and spotting linguistic features far beyond the capabilities of any conventional text storage system. Its unique design has potential far beyond text storage as a key to building highly secure computer systems that would need virtually no maintenance by software engineers.

“While writing my thesis on the logic of prediction and randomness I gradually came to realize the limitations of mathematics,” he says. “I decided that the potential of human language was much more far-reaching, because while there is a limit to the number of words, there is no limit to the number of potential ideas given different combinations of those words.”

The move into business came in the late 1970’s. He was on a fellowship at Edinburgh University researching the origins of metaphor in Greek drama.

One day, travelling to London by train, he fell into conversation with a fellow passenger, Fed Heath, professor of computing at Heriot Watt University. Woyka described an idea he had for storing the entire works of classical Greek on a computer. The method is so simple and elegant that it was remarkable it had never been done before.

It was based on the fact that computers, designed as number-crunching machines, have traditionally used a very clumsy approach to handling text. In conventional computers each letter of the alphabet occupies one “byte” of computer storage – a sequence, or pattern, of eight 0s and 1s. Woyka spotted the fact that this was extremely wasteful and that, since there were 128 possible permutations, it was possible to use each permutation to specify one of 129 words.

By making the permutations correspond to the 128 most common words in the language – “and”, “the”, for example, he was able to store roughly half the volume of any text at the rate of just one byte per word – a dramatic reduction on the conventional one-byte-per-letter technique.

The next problem was to work out a way to encode the remaining words. Woyka planned simply to extend the technique to use a second byte of storage. This gave a total of sixteen 0s or 1s, or 16,000 possible combinations. Every time the system came to a word not on its first list of 128 words, it would merely allocate the next number in the sixteen-digit binary sequence to the new word. By this method it would be able to store a vocabulary of up to 16,000 words in no more than two bytes per word.

Where very large vocabularies were needed the system might need to use three bytes to get enough permutations to have one for every word in the text. But this would still be far less than the old method which would take for instance 28 bytes to define one long word such as antidisestablishmentarianism.

The technique should work for any language. Professor Heath agreed to collaborate with Woyka to build a prototype for reaching English. The Science of Engineering Research Council refused them funding. What they were proposing was likely to take an experience research team 10 years, said the SERC, and Heath and Woyka had no experience. But within a week Woyka and Heath had built a rudimentary laboratory lash-up that worked at a speed equivalent to reading the entire Bible in three seconds.

“Heath wanted us to sell the idea to IBM, but I said we should form our own company,” says “Woyka, so they founded Memex. It sold exclusive marketing rights to Gould, the huge American electronics company for $1.5m in an agreement lasting until February of this year. Gould has installed around 100 Memex machines for customers including the US government. The system has already been adapted to read Arabic.

As well as compressing far more information into far less memory space Woyka’s system has another big advantage; it needs no index.

Conventional text handling systems carry an enormous overload. Like Woyka’s system they store all the text, but because it is not so well-compressed they cannot use a “brute force” search each time the user wants to locate a specific piece of information. Instead they have an index, which usually occupies a much computer memory as the text itself.

Users are entirely dependent on the quality of the index. Items can be searched for if the person who crated the index through they were significant.

“Instead of using an index, we do a needle-in-a-haystack type of search using brute force to go from the beginning of the text to the end each time,” say Woyka. “It is rather like having a battery of computers running in parallel as if they were express trains whizzing past stations which each had a grabber looking out for one particular word. The traditional approach is like having one train that can only delivery one mailbag at a time and has to keep shunting back to the terminus – the computer where the index is held – to find out where to go for the next delivery.”

Because the Memex machine stores all words, including those commonly through of as unimportant, such as “the”, “and: or “to”, it is also able to do more specific searches than indexed systems. For instance, it can pick out “the solution to” as separate from “chemical solution”. It can also spot punctuation marks to find, for instance, that the Bible has 206 exclamation marks.

But the technology has uses far beyond scanning text. Nothing can be hidden or disguised. So it is ideal for producing audit trails and spotting the intervention of computer hackers.

Now he has another technological idea based on his studies of metaphor – a chemical analysis engine. The idea is that you can define something extremely accurately by relatively crude observations if you have several different ones.

“If you say ‘that man is a wolf’ you don’t literally mean that he is like a wolf, you mean that you wouldn’t like your daughter to accept a life home in his car,” explains Woyka. If you could make a few other, equally specific comparisons about the man, you would end up with a very clear picture.

Similarly it is possible to make an accurate chemical analysis by using a series of simple tests such as bounding a number of different types of energy signal – infra red, or ultra violet – off a sample.

“Even if each individual test is only accurate to within 20%, providing that you can devise five totally independent test you can get an accuracy to within 0.032% of a percent using the metaphor principle,” Woyka says.

The Sunday Times 25th October, 1987
 

Dr. Woyka   Tel 0620825329
Standingstone,
Haddington,
Scotland  EH41 4LF

Memex Information Systems
16 Albion Way
Kelvin Industrial Estate
East Kilbirty?
Scotland