At 17:43 2004-03-09 -0600, Jay Moore wrote:
I guess that its effect on your classifier would depend upon several
variables, but unless you have a way to exclude them from being
classified they will affect the performance of your classifier. The
effect may be subtle if your classifier already has a large corpus.
Beyond that - why do you think someone would bother to send these
things?
Some simply use random runs of English words.
Note that "Hungarian notation", a variable-naming scheme used widely in
windows-based software development (though certainly not restricted to it),
can easily trip a consonant-weighing filter.
Code in general can pose problems for things which assume _English_
distribution of characters. Or proper spelling for that matter.
Let me give you an example of a C function prototype for windows:
LRESULT CALLBACK MySubClasser (HWND hWnd,
UINT msg,
WPARAM wParam,
LPARAM lParam );
This, I took right out of a book - flipped it open, grabbed the first code
construct on the page.
Now, just above that, there's code for getting a window process address:
FARPROC pfnOldProc;
pfnOldProc = (FARPROC) ::GetWindowLong(hWnd, GWL_WNDPROC);
some other lines of code:
CCntnrCntrItem* pItem = NULL;
virtual DWORD OnLog( CHttpFilterContext* fpc, PHTTP_FILTER_LOG pLog);
Code has a nasty habit of breaking English text rules. Use some of the
above as an example, none of which represent EXTREME examples of code, but
should certainly demonstrate how you can have some impressive runs of
consonants.
Of course, if you don't discuss code in email, then these issues might not
present themselves, but I humbly submit that anyone writing code to weigh
consonant:vowel ratios should definatley run it against source code, in
various programming languages, to see how it will react.
Search the archives for "Garbage vs. Valid" from February 2002, and also
"Regexp problem" from December 2003. The first discusses using a consonant
weighting for funky hostnames in received headers (but could be
adapted). The second thread omits some stuff which was exchanged offlist
with someone who deliberatley circumvented by Reply-To, and thus is incomplete.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail