procmail
[Top] [All Lists]

Re: Keep getting subborn spam with random words

2004-03-09 21:45:42
At 17:43 2004-03-09 -0600, Jay Moore wrote:
I guess that its effect on your classifier would depend upon several
variables, but unless you have a way to exclude them from being
classified they will affect the performance of your classifier. The
effect may be subtle if your classifier already has a large corpus.

Beyond that - why do you think someone would bother to send these
things?

Some simply use random runs of English words.

Note that "Hungarian notation", a variable-naming scheme used widely in windows-based software development (though certainly not restricted to it), can easily trip a consonant-weighing filter.

Code in general can pose problems for things which assume _English_ distribution of characters. Or proper spelling for that matter.

Let me give you an example of a C function prototype for windows:

LRESULT CALLBACK MySubClasser (HWND     hWnd,
                                UINT    msg,
                                WPARAM  wParam,
                                LPARAM  lParam );

This, I took right out of a book - flipped it open, grabbed the first code construct on the page.

Now, just above that, there's code for getting a window process address:

FARPROC pfnOldProc;
pfnOldProc = (FARPROC) ::GetWindowLong(hWnd, GWL_WNDPROC);

some other lines of code:

CCntnrCntrItem* pItem = NULL;

virtual DWORD OnLog( CHttpFilterContext* fpc, PHTTP_FILTER_LOG pLog);

Code has a nasty habit of breaking English text rules. Use some of the above as an example, none of which represent EXTREME examples of code, but should certainly demonstrate how you can have some impressive runs of consonants.

Of course, if you don't discuss code in email, then these issues might not present themselves, but I humbly submit that anyone writing code to weigh consonant:vowel ratios should definatley run it against source code, in various programming languages, to see how it will react.


Search the archives for "Garbage vs. Valid" from February 2002, and also "Regexp problem" from December 2003. The first discusses using a consonant weighting for funky hostnames in received headers (but could be adapted). The second thread omits some stuff which was exchanged offlist with someone who deliberatley circumvented by Reply-To, and thus is incomplete.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail