procmail
[Top] [All Lists]

Re: (Parsing dates and) persistent lock files that never go away

2004-07-23 01:30:19
On Thu, Jul 22, 2004 at 10:37:55PM -0600, Justin Gombos wrote:
* Dallman Ross <dman(_at_)nomotek(_dot_)com> [2004-07-22 14:54]:

  YEAR=`formail -x "Date: " | sed -e 's/.* 
\([12]\{,1\}[90]\{,1\}[0-9][0-9]\) .*/\1/' \
                                  -e 's/^[^0][0-9]$/19&/g' \
                                  -e 's/^[0][0-9]$/20&/g'`

I don't even know what you are trying to do there.  But it doesn't work,
at least on my system:

  sed: 1: "s/.* \([12]\{,1\}[90]\{ ...": RE error: invalid repetition 
count(s)

In english, I'm trying to extract the Date: field using formail, then
the first sed instruction parses out the year only, which can be of
the form 19xx, 20xx, or xx.  If it is of the form xx, then the next
two sed instructions try to pre-pend a "19" or "20", whichever makes
the most sense.

It's interesting that you get an error on repitition counts.  My
version of sed has no issue with using \{,1\} to mean the
subexpression ahead of it may occur zero or one times.  

Well, my sed has no issue with it either, in general.


Whatever you are trying to parse, though, you surely wouldn't need
formail pipled through such a hairy sed thing to come up with some
some component of the year.  Maybe you will enlighten us.

I cannot think of any other way of parsing out the year, but I always
appreciate being shown a better or different way of doing things.

Does anyone know of a different way to extract the year?

Yes.  You do not need any pipes to extract the year.  Your 19/20
thing would add some complexity, but even so it could be done all
in procmail.  But first, I have to ask: why are you using the
string from the Date: headers, which in general is untrustworthy,
rather than the one from From_ (top line), which your server
(with or without procmail's help) generates and which is as
trustworthy as the clock on your server?  The point does seem
to be to archive the mail according to received-time, yes?
Even if you are backtesting old mail, the From_ header would
normally still be the old date.


I was actually planning to complicate the 'year extractor' even more,
because sometimes the Date: field is empty, non-existent, or invalid,
in which case I would want the YEAR variable to then try to contain
the year located in the date that trails at the end of the From_
field.

Yes, so why not use that to start?  You have just demonstrated
part of the untrustworthiness of the Date: header.  (It could be
said, but maybe this will wear off with my second cup of coffee
[not yet begun], that the Date header is about as trustworthy
as a bonded employee who has not yet submitted to drug testing.) :-)

That recipe results in delivery to the following mbox FILES (for
example):

  in/drug_testing_2003
  in/drug_testing_2004
  in/drug_testing_1997
  in/drug_testing_0000

Approaches to finding $YEAR to consider: (1) Use From_, if possible,
in procmail; (2) Use the top-most Received header (was written by
your machine when mail was received); (3) use the time-stamp on the
message, if it hasn't changed.  GNU-date can set the date based
on a filestamp.

If all those are rejected and you still want to use the Date: header's
asserted year, you can still easily do it in procmail only.
Here's a sample Date: header (from some spam I have lying around,
but heh):

   Date: Tue, 20 Jul 2004 11:47:42 +0000

Lots of ways to approach this, but the time should always be
there, so I'd use that since the colons are a nich anchor
or search object:

  :0
  * ^Date:.*\/(19|20)?[0=9][0-9][^a-z]+:
  * MATCH ?? ^^\/[^     ]+
  { YEAR = $MATCH }

Since spam is a good testbed for likely problems with presumptions
about headers (the spammers tend to do lots of things badly),
I ran all my current batch of spam (104 messages -- two cam
in since my last  vetting, wich always pares it down to 100
in the cache).  They all come up clean -- which doesn't say
much for what corruptions are possible, only that these
particlar last 100 spam messages are relatively clean with
the Date:

 9:54am [~/Mail] 225[0]> harness myspam | grep YEAR | distrib

 104 procmail: Assigning "YEAR=2004"

Okay, what about deprecated formats with only two numbers?
Given that I accept at Y2K+10 problem, which I do here, :-)
this is fine:


  :0
  * YEAR ?? ^^[^0].^^
  { YEAR = 19$YEAR }

  :0 E
   * YEAR ?? ^^..^^
  { YEAR = 20$YEAR }

  

By the way, if someone in a TZ not yours sends a message near
midnight at the end of the year, which year are you putting it in?



I have not seen this lock file problem repeat since I increased my
LINEBUF size, so I'll have to do some more extensive testing, and come
back to this thread if the problem still exists.


Well, as I said previously, with all those long assignment vars,
it's not at all suprising to me that you exceeded LINEBUF.

Dallman

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail