procmail
[Top] [All Lists]

RE: Garbage vs Valid

2003-02-01 16:00:00
dman(_at_)nomotek(_dot_)com [mailto:dman(_at_)nomotek(_dot_)com] wrote:

PSE-L(_at_)mail(_dot_)professional(_dot_)org (Professional Software 
Engineering) wrote:

At 13:11 2003-02-01 -0500, fleet(_at_)teachout(_dot_)org did say:

I suppose it's "life experience" (or something); but "tatanka"
appears to be "ok" (with minor reservations) whereas "pxhkieb" and
"muhirhw" are (to me) immediately suspect.  How does one describe
this in code?

You could probably use character classes which define that say
that runs of more than two consonants are suspect.  Look at the
distribution of vowels and consonants within English words and you
might note that pattern.

Yes, exactly.  I tried yesterday, coincientally, looking for four-
plus consonants in a row, but there were lots of false 
pozzes.  However,I'm not through with the experiment.  

Btw, can you believe there are English words with 10 consonants in a
row?!
I wouldn't have guessed that.

Two hundred thirty-four thousand, nine hundred sixty-four words in 
/usr/share/dict/words.

 11:34pm [~/Mail] 758[0]> wc -l /usr/share/dict/words
  234964 /usr/share/dict/words

Let's check for those with nine consonants in a row:

 11:34pm [~/Mail] 759[0]> egrep '[bcdfghjklmnpqrstvwxyz]{9}'
/usr/share/dict/words
Amblyrhynchus
glycyphyllin
Oxyrrhyncha
oxyrrhynchid
pachyrhynchous


Ten in a row?!

 11:34pm [~/Mail] 760[0]> egrep '[bcdfghjklmnpqrstvwxyz]{10}'
/usr/share/dict/words 
Amblyrhynchus
glycyphyllin

Eleven?!

 11:35pm [~/Mail] 761[0]> egrep '[bcdfghjklmnpqrstvwxyz]{11}'
/usr/share/dict/words 

No, none of those (thankfully).


With a little work, though, one could use this database to compile a
list
of impossible or highly unlikely combos in English and most Western
languges.  For example, the "PH" combo is present in the two words with
ten abutting consonants; as is a "Y."  Take those out.  Consonants that
are often doubled should also be low on the improbability list, such as
the doubled L in the second "ten"-word, double T, etc.

-- 
        "Weltbedenkend, ortlich lenkend!"
                -- Original von W. Dallman Ross


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>