procmail
[Top] [All Lists]

Match operator and massive regex

2001-12-04 02:13:38
NOTE: My presentation is somewhat poor, but I found it a little hard
to present a brief understandable review of this problem. 

Having a problem understanding what is happening with the regex below
as it is used in a MATCH operation.  (NOTE: The full actual regex is
the last thing in this message)

Query summary: The regexp below matches in some unforseen ways. I want
to understand how this occurs. How the matching actually works
(internally) so I can adjust it to get the results I need.
  
Details: 
The regexp is made from a list of newsgroups, sort of verbatim without
trying to shorten up or make it more efficient. You'll notice in the
examples it consists of newsgroup names with meta stuff added to make
a regex of them.

It seems to largely work as planned.  Let me describe what that plan
is in brief:

This is a setup that reads from an nntpserver feed and writes to mbox
spool files.  My aim was to grab a match from the `Newsgroups:' header
present on every message, and create a target  DELIVERY file with
that name.

The `formail and procmail command that feeds this is:

nntpserver data stream => \
|formail -m5 -d -e -s procmail -m ${HOME}/projects/proc/.proc_nntp_split

Incoming message has:

   Newsgroups: comp.lang.awk

It gets written to mbox file:  

   1x.comp.lang.awk.in

The prefix '1x.' and suffix '.in' are added to facilitate other parts
of the setup.

Complications arise from crossposted messages that have several
newsgroup names in the Newsgroup header.  I want my match operator to
find the first one that matches the regex containing my newsrc list.

I'm not sure how this match actually occurs.  I mean the technical
internal regex engine part.  Like in alphabetic order or first to last
or what. 

So the actual (partial) procmailrc looks like this (trimmed for brevity -
the actual regex is at the end of this message).

========================================
## Match Newgroup names to one of newsrc list to form DELIVERY file name
  :0fh
* Newsgroups:[     ]*\/(comp\.editors|comp\.emacs|comp\.emacs|comp\.lang\.awk)
{
   DELIVERY=$MATCH
}

## If we have a bonafide `PATH: ' header then write it to DELIVERY
 :0
* ^Newsgroups:
* ^Path: 
 1x.${DELIVERY}.in
 
========================================

I only showed 3 groups in the regex to give the idea. There are
actually 21, but in the same format as shown.

The idea here is to create files with names that match my newsrc list.
As you may imagine a message may come off the server from one of my
desired groups, but several names in the `Newsgroups' header are
groups that are not on my list.  I don't want to generate files for
those names.

This seems to be working but I find one odd named file being produced
that looks like: 
     1x..in (verbatim)
And it catches quite a few messages.  I can't quite see where this is
coming from.  How any thing matches that file name.  I've shown a few
random Newsgroups: headers from messages that fell into that file below:

Newsgroups: gnu.gcc.help,gnu.g++.help,comp.emacs,gnu.emacs.help

Newsgroups: uk.comp.os.linux,comp.os.linux.misc,comp.unix.shell

Newsgroups: comp.lang.c,comp.unix.solaris

There were more but in every case it looks as if the newsgroup that
should have matched my regex is the last component in the header.

Can someone see what is generating this `1x..in' group.  It consists of
the prefix and suffix only so maybe it means there was no match.  But
then the header shows group names that seem like they should have
matched the regex.  Regex is below: 

 * Newsgroups:[     
]*\/(alt\.test\.yer\.posts|comp\.editors|comp\.emacs|comp\.emacs|comp\.lang\.awk|comp\.lang\.perl\.moderated|comp\.os\.linux\.security|comp\.security\.ssh|comp\.unix\.questions|comp\.unix\.shell|comp\.unix\.solaris|gnu\.cvs\.help|gnu\.cvs\.help|gnu\.emacs\.gnus|gnu\.emacs\.help|mailing\.freebsd\.net|mailing\.freebsd\.security|newsguy\.general|newsguy\.test|news\.software\.nn)
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>