procmail
[Top] [All Lists]

Re: Robust date code (year extraction)

2004-07-28 01:50:29
On Sat, Jul 24, 2004 at 01:27:12PM -0600, Justin Gombos wrote:
With dman's help, I've written the following code to extract the year
from a message.  Here's what I have so far:
  
  ARCHIVING=true
  
  # Extract the year from the date that the author
  # *claims* to have composed the message.
  #
  # Note: this recipe will result in a null value
  # for dates formed as XX-XX-XX.
  #
  :0
  * ^Date:.*\/(19|20)?[0-9][0-9][^a-z]+:
  * MATCH ?? ^^\/[^     ]+
  * 19^0 MATCH ?? ^^19..^^
  * 20^0 MATCH ?? ^^20..^^
  * 19^0 MATCH ?? ^^[^0].^^
  * 20^0 MATCH ?? ^^0.^^
  * MATCH ?? ^^.*(19|20)?\/[0-9][0-9]^^
  { STATED_YEAR = $=$MATCH }

Well, that's kind of screwy, frankly, and misapprehends some of the
implicit value of the approach I initially suggested to you.

I had suggested this (my typo in first condition now corrected):

  :0
  * ^Date:.*\/(19|20)?[0-9][0-9][^a-z]+:
  * MATCH ?? ^^\/[^     ]+
  { YEAR = $MATCH }

  :0
  * YEAR ?? ^^[^0].^^
  { YEAR = 19$YEAR }

  :0 E
   * YEAR ?? ^^..^^
  { YEAR = 20$YEAR }

That gets you the year in four-char format.  The algorithm's implicit
point was that once we match what should be the year in the Date header,
we don't need to do anything more if it's already more than two chars
long.  Only if it's two chars long do we need to prepend a 19 or 20.

That said, of course we've already established that the Date header
is untrustworthy generally, so it certainly can be the case that some
string we think is the year but isn't two chars long also is not
four chars long.  So I'd simply put one more recipe afterward (below
the third recipe above) for that contingency:

  :0
  * ! YEAR ?? ^^(19|20)[0-9][0-9]^^
  { YEAR = unknown }

It might be better all around to take the year from the top-most Received
header -- less likely to be corrrupted and less likely to be wrong.
If there is only one Received header (or none), though, then we're
back to no trust, since the spammers forge those or some foreign server
connecting directly to your machine could be misconfigured.

Anyway, to find the year in the Received header, capture it with
the match token and find the best anchors, similar to above.

-- 
dman

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>