RE: new features for procmail

-----Original Message-----
From: Ruud H.G. van Tol
Sent: Monday, November 14, 2005 1:34 AM
[...]
Maybe a 4-flag on the recipe, after (but maybe unconnected to) the :0.
That will certainly look strange enough. The '4' maybe implies the 'p'
we talked about earlier.

   :0 4   # using procmail-4 features
   * ^^From \S+ (.*)(?{ FROM_TIME = mktime($^N) })


Ugh.  I was working on the assumption that the new syntax
eventually becomes the widely used syntax, so would hate to
decorate it with 4's. My tendency is to prefer
another leading character for the new pattern matching.
Obviously, it can't be "/" or "|" and anything else that
can begin an action.  But maybe "+" is okay:

:0:
+ ... new pattern match with other new semantics like search/replace
+ ...
... action ...

More radical is a new syntax altogether.  The new syntax would
be upward compatible in that it readily accepts the ":" recipe
syntax, but might perhaps be more easily recognized, where
flag letters also have short names and just perhaps locking
is on by default.  An example:

:FILTER filter_name MATCH_BODY_ONLY FIlTER_BODY_ONLY DISTINGUISH_CASE
NO_LOCK
*  ... new matching rules ...
*  ....
<action (as before)>
:END FILTER filter_name

Introducting begin/end sequencs with names is meant to improve
readability, offer better error diagnostics and to ensure that
the recipes are well-delimited.  I can see where this might not
be a popular suggestion ... just noting that it is possible to
improve readability and reliability and to extend the semantics
without doing too much harm to the syntax.

Above, it would be easy to drop the begin/end and name stuff and
just introduce the new syntax with :FILTER.

Maybe another name than mktime() is better, as long as it returns the
seconds from the start of Jan 1, 1970 (UTC).
(64 bit!)

- add 'expr' functionality for evaluating simple expressions


Maybe implemented as a calc() function.


Yes.  I think the "how" part enters the picture here.  How
should some of this new functionality be expressed?

- mail address parsing support.  Reliably and correctly parse the
  addresses in TO: FROM: and any other headers.  Once the address
  has been broken out, support splitting it into name and address
  and host name part.


Read `perldoc -q address`, which points to
http://www.cpan.org/authors/Tom_Christiansen/scripts/ckaddr.gz
but also warns: there are deliverable addresses that aren't
RFC-822 (the mail header standard) compliant, and addresses
that aren't deliverable which are compliant.


Whatever. <g>  I'd opt for whatever is generally acceptable.
Clearly some well-defined, reliable, implementation would be
better than the many attempts made by procmail users today.

- provide support for parsing the TLD from a hostname


   :0 4
   * SOMEHOST ?? ([^.]+)\.([^.]+)$(?# to do: 'co.uk' etc.)
   {
      DOMAIN = $1
      TLD    = $2
   }


I'm not against this if somehow the functionality can be
wrapped in an easy-to-use and understand interface.  The
function syntax you allude to above for exmaple:

DOMAIN = domain("$TO")
TLD = tld("$TO")

for example, where $TO is one of the TO_ addresses.

or even

   # the following match() returns a list of 2 elements
   (DOMAIN, TLD) = match( $SOMEHOST, /([^.]+)\.([^.]+)$/ )

   FROM_TIME = mktime( match( /^^From \S+ (.*)/ ) )



Maybe a variable name should always have a $-prefix:


That seems like a bit too much of departure from procmail's
syntax.  It happens to be one of the more irritating aspects
of programming in Perl, IMO.

   :0 4
   * $SOMEHOST =~ ([^.]+)\.([^.]+)$(?# to do: 'co.uk' etc.)
   {
      $DOMAIN = $1
      $TLD    = $2
   }

   ($DOMAIN, $TLD) = match( $SOMEHOST, /([^.]+)\.([^.]+)$/ )


Too Perly.  Though, I like the idea of match function().

I think we all know we could embed Perl into the new recipe
definition, but this really would make the new program large
and slow, and it would require that procmail users become
Perl cognizant.  Still, having Perl running about would
definitely give you all the generality you'd ever require.

- ability to extract URL's and/or e-mail addresses from message bodies
  (arguably could be implemented with the new PCRE support, but this
  extraction would be builtin, faster, and syntax added for looping
  through the addresses/URL's)


These URLs often need quite some decoding, because they contain all
kinds of tricks to not get recognized, including Unicode-lookalikes of
familiar characters.


Agreed.  But it is exactly the fact that processing URL's is
tricky to get right that the processing should be directly
supported.

- In addition to PCRE matching, implement "approximate matching",
  ala String::Approx (on CPAN)


http://search.cpan.org/~jhi/String-Approx/Approx.pm
I think it is hard to make that practical, since it basically works on
short strings like single words. It would not work very well against the
variant spellings of 'viagra', because the 'Levenshtein edit distance'
can be made arbitrarily big.


I tend to disagree.  Not on the technical point per se, except that for
many viagra spellings the edit distance is pretty small.  Other
approximate matching algorithms exist, and those can be evaluated as
well.  But you're right about the intent ... to make it easy to catch
mis-spellings, mainly for the purposes of spotting spam phrases.
But not always (think of the mis-spellings of "subscribe" and other
words that come up in practice).

- Improve procmail's performance by having it statically compile
  scripts, where possible (recursive includes and includes that can't
  be statically evaluated throw this out), and use the statically
  compiled .pc-procmailrc as long as it is newer than .procmailrc,
for example.


Many recursive includes can be statistically evaluated.


That'll be the implementor's problem, not mine. <g>


Something else: the AND/OR issues of conditions.

Current way:

  EITHER  = '9876543210^0'
  OR      = $EITHER
  OR_EVEN = $EITHER


If your suggestion to offer better syntax for anding
and or-ing conditions, I definitely agree that would
be good.  In my view, procmail's scoring is great,
but it really should be reserved for situations where
the user really wanted scoring.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail