Re: matching empty text/plain

On Mon, Jun 16, 2003 at 09:40:51PM -0700, Bart Schaefer wrote:

  WSPC=" ^I" # you-know
  NL="
"


A variable containing a newline is nearly always used for constructing
e.g. LOG assignments.  I don't think it works properly to use it as a
member of a character class as you suggest.  You're much better off using
($) to match a newline.


Wup -- yeah.  I wrote that part in the email editor rather than copying
it from a real recipe.  Think-o....

  :0 B
  * 1^1   ^Content-Type:
  { PARTS=$= }


Do you need to know the number of parts?  If not,


Yep:

    :0 A
    * PARTS ?? ^^0^^
    { W="multipart header but no parts in body" }

    :0 E
    * PARTS ?? ^^1^^
    * ^Content-Type: text/html
    { W="multipart header but single html part body" }

However, it's
entirely possible to send a content-type: multipart containing only a
single part, so I'm not sure why you're bothering.


I don't think I've ever seen a Content-Type: multipart header with a
single part that was not spam.  I'm pretty sure that most MUAs, if
configured to send HTML-only email, won't bother making it multipart.
I'm willing to be corrected on that....

 You also apparently
don't care whether there's a multipart inside a multipart with an empty
part in the inner multipart?


Ooh, hadn't thought of that.  I guess you could just handle that with:

    * ! B ?? Content-Type: multipart

(You find that easier to read than mimepart.txt was?)


In that I can read it without scrolling....  ;)

The semicolon ater text/plain is only optional when no charset is present,
but you're requiring a charset to be present.


Ah, right.

 There's also no point in
looking explicitly for C-T-E, because a mime part can have all sorts of
headers (Content-Disposition, Content-Id, etc.) so really you only care
about either (a) Content-Type or (b) that there are no headers at all.
You just need to skip the other ones.


Sounds good, and your method is nicely implemented.  Thanks.  I included
C-T-E because in all the spam examples I found, the only header besides
Content-Type was "Content-Transfer-Encoding: quoted-printable".

You should use $\MATCH to prevent things like dots in the boundary string
from being treated as pattern characters, and you should be prepared for
the closing boundary to be the end of the entire multipart.


Ah, right.  And the text/plain part doesn't *always* come before the
text/html part.

So this might be more like it:

WSPC="        "
NL="($)"


Why put it in brackets?  Does it change the variable interpretation in
the recipe, or is it to avoid the possibility of erroneous variable
expansion as you set $NL?

:0
* $ ^Content-Type:[$WSPC]*multipart/.*boundary="\/[^"]+
* $ B ^--$\MATCH$NL(\
       (.+$NL)*\
       Content-Type:[$WSPC]+text/plain.*$NL\
       (.+$NL)*\
      )?$NL\
       ([$WSPC]*$NL)*\
      ^--$\MATCH(--)?$NL
{ W="empty text/plain block" }


That's great.  The last thing I'd suggest would be to make the final
WSPC atom 1-or-more rather than 0-or-more, since lack of a blank line
would cause the next boundary (and its block) to be tacked on to the
bottom of the previous block.  I think.  ;)

Thanks very much for this.  It will make an excellent addition both to
my own spam filters AND the list archive....

-- 
  Paul Chvostek                                             
<paul(_at_)it(_dot_)ca>
  Operations / Abuse / Whatever
  it.canada, hosting and development                   http://www.it.ca/


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail