nmh-workers
[Top] [All Lists]

Re: [nmh-workers] detecting enclosed msg as spam - unicode regex help needed, I think; spam unicode chars in header

2019-05-04 12:01:35
On Sat, 04 May 2019 06:16:27 -0500, nmh@trodman.com said:
This is OT/not a nmh issue.  It does concern spam unicode chars in the mail 
header
though, so maybe you could direct me a bit.

I use procmail, so I should be able to filter out msgs like the above,
but I could use some tips on a general strategy.

Note that the presence of =?utf-8? in the headers is *not* always proof of
spam (see headers of this message), so be prepared to deal with false positives
appropriately (but see below).

Also, note that while procmail does support onboard regular expressions, they're
not a full PCRE set of expressions.  So, for instance, you can't look for utf-8 
strings
of more than a certain length by searching for 
'=?utf-8?q?(=[0-9a-fA-F][0-9a-fA-F]){5}',
nor can you check for more than 10 occurrences via '(=?utf-8?.*){10}'.

You're probably better served by installing SpamAssassin and calling that from
procmail (as it will help with things other than utf-8 as well.

There's 90 presumed-spam messages in my spam folder at the moment.  Of those, 
12 have
one bodypart and specify charset=utf-8  in the rfc822 headers, while 44 specify
multipart and thus the charset=, if any, is buried in the body.  10 have raw 
utf-8
in the From: line, and 17 have raw utf-8 in the Subject: line (but see below)

And something in the e-mail ecosphere is filtering and converting explicit
=?utf8? encoding in rc-822 headers.  I was going to blame mhfixmsg, but it's
happening before procmail gets hold of it.  I send mail to myself, 'send'
tosses it to Google, Google hands it back to me via fetchmail/imap thence to
sendmail and procmail, and the =?utf8 has been already decoded. I invoke
mhfixmsg as '| tee $tmpfile | mhfixmsg -noverbose -file - -outfile -', and the
version in $tmpfile is already converted.  Meanwhile, some other mail
arrives with raw chars, while some *does* arrive with =?utf-8? still intact.

Weird.

Attachment: pgp0Futgcjpob.pgp
Description: PGP signature

-- 
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
<Prev in Thread] Current Thread [Next in Thread>