procmail
[Top] [All Lists]

Re: Garbage vs Valid

2003-02-02 13:26:05
In addition to all these examples of "legitimate" words and
consonant/vowel combinations; we are faced with a plethora of acronyms and
abbreviations - smtp, nato, unicef, usmc, etc.  Even internet specific
"expressions" such as ymmv and rotflmao convey "intelligent meaning."

Let us not forget the international aspect of the internet; where we find
letter combinations like shch, ts, sch, cz and others that are not common
in English.

In addition, when scanning the headers, I run into things like terry,
benny, robert, felicia, peggysue, etc. which have no dictionary meaning;
but are immediately recognizable (by English speakers anyway) as valid
names.

I begin to suspect that a dictionary would be the only real answer to what
constitutes "garbage" - and that a dictionary of "garbage" would be
immensely smaller than a dictionary of "non-garbage."

                                - fleet -

On Sun, 2 Feb 2003, Professional Software Engineering wrote:

At 12:30 2003-02-02 +0100, Ruud H.G. van Tol did say:
Dallman Ross skribis:

Let's check for those with nine consonants in a row:

 11:34pm [~/Mail] 759[0]> egrep '[bcdfghjklmnpqrstvwxyz]{9}'
/usr/share/dict/words
Amblyrhynchus
glycyphyllin
Oxyrrhyncha
oxyrrhynchid
pachyrhynchous

All those words are more vowelly than you assume.

As was pointed out right from my first post "AEIOU, and sometimes Y",
should mean that y is always treated as a vowel for the purposes of a
consonant-run test.  I recall that the post that you're quoting Dallman
from also specifically mentioned exluding y and "ph" as well.

Including "ph" into an expression would be easier than excluding it
(although not a vowel, I'm including it in the following variable to
demonstrate the syntax of the expression:

VOWEL=([aeiouy]|ph)

Thus, it would be easy enough to check for vowels.  However, checking for
the consonants as a character class not including ph is a bit more
complicated.  I welcome seeing someone else scribe that one at the moment,
as I'm a bit tied up in making sure I'm available to someone for a server
relocation project.

Also, concatendated words pose a peculiar problem, and are exceedingly
common in computer and internet use (the citation of "earthlink" is an
example), which poses a particular hurdle for my original suggestion of
possibly looking for runs of three or more consonants.  I suspect that "th"
and "rh" should also probably be added to the "exclude me as a consonant"
tests.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>