mhonarc-users

Re: The fundamentals of MHonArc...

2000-07-07 14:42:22
On July 6, 2000 at 17:22, ERIC PRETORIOUS wrote:

What I'm struggling with most in
attempting to employ MHonArc on our intranet is understanding the
fundamental strategy: How to get messages from wherever-they-come-from
to the archive. Also, I really would like to know how to create
maillists for each month's messages but don't quite understand what that
requires either. I've almost read the entire manual (including the
QuickStart Guide) but - like I said - I just don't have the
experience/background to know what to do with all of this.

The MHonArc docs do not cover the management aspects of how to feed
mail to it.  Since there are numerous ways, I have not not bothered to
document such things except for some examples in the FAQ.

Looking at the server that I've inherited, I understand that the
previous webmaster employed sendmail to pipe messages to MHonArc using
the /etc/mail/aliases file. (...so I guess that there aren't any MH
folders to convert or process?) Is this a good approach? In his July 29
message, Bosco Tsang mentioned running MhonArc (on a Windows machine)
daily by using Scheduler. Are there advantages to this approach over the
aliases/pipe approach?

Under Unix-based systems, you will more-in-likely deal with sendmail.
At what level depends on how your mailing lists are set up.  For simple
lists, sendmail aliases can be used.  However, using sendmail aliases
for defining lists is pretty limited.  Hence, many use mailing list
software like Majordomo.

For interfacing to MHonArc, I favor the approach of having a special
user account that subscribes to the list(s) you want archived.  This
account will then be responsible for generating the web archive.  The
main advantage to this method is that it works independent of the
mailing list software.  Therefore, if list software changes, you do not
need to change how the web archives get generated.  Also, it allows
independent maintainence of the web archives from the mailing list.  A
good, big real-world example of this technique is
<http://www.mail-archive.com/>.

As for the recommendations on how the special mail user account will
generate the archives, I favor to break down the tasks into as many
independent units as possible.  I could use my recent setup of some
internal archives at work as an example.

The first thing I did was just concentrate on getting the mail from the
spool area (normally /var/spool/mail/$LOGNAME) to a location where
messages are stored by mailing list destination with messages stored in
monthly mailbox files.  Notice, how I am not even considering the use
of MHonArc yet.  The strategy is to get the mail in a storage format
conduscive to creating web archives.  Example:

    archive/mbox/
        list-1/
            2000-01
            2000-02
            2000-03
            ...
        list-2/
            ...
        ...

Once I verified this is working, I can then figure out what I will
do on the MHonArc side.  The storage model is friendly to creating
monthly archives of the mailing lists, but I currently decided to
have the web archives just show the last 1000 messages of each list.
However, since I keep the original message data, I can choose to
switch to a monthly archice scheme.

As of choice of tools, I use Procmail for storing the mail from the
spool file.  Note, if dealing with a POP server, you could have a simple
Perl program to download the mail to a temporary mailbox file and
have Procmail processes that.  In one archive setup, I actually had
the Perl program download from the POP server and do the filtering
to the separate monthly mailbox files.

I use Procmail since the account is subscribed to multiple mailing
lists.  Hence, I use it determine which list a message belongs to and
to have it safely deliver it to the monthtly mailbox file for the list.

The whole process is invoked via cron.  This way, I can control
how frequently updates occur.  I traffic is heavy, you can decrease
the update frequency to allow enough time for new mail to be archived.

What follows are the relevant files for what I did:

read-mail cron script:

    #!/bin/sh
    cd $HOME/archive
    ./filter-spool && ./web-archive

The crontab entry I use is:

    23 * * * * /spare/users/mhonarc/archive/read-mail

In sum, it runs once an hour: 23 minutes past the top of each hour.

filter-spool: Script that grabs mail from the spool area and stores
        in monthly folders:

    #!/bin/sh

    PATH=/usr/local/bin:/usr/bin:/bin; export PATH

    umask 022
    ORGMAIL=/var/spool/mail/$LOGNAME
    #PROCMAILVARS="VERBOSE=yes LOGABSTRACT=yes"

    if cd $HOME/archive &&
     test -s $ORGMAIL &&
     lockfile -r0 -l1024 .newmail.lock 2>/dev/null
    then
      trap "rm -f .newmail.lock" 1 2 3 13 15
      lockfile -l1024 -ml
      /bin/cat $ORGMAIL >>.newmail &&
          /bin/cat /dev/null >$ORGMAIL
      lockfile -mu
      formail -s procmail $HOME/archive/.procmailrc $PROCMAILVARS <.newmail && \
          /bin/rm -f .newmail
      /bin/rm -f .newmail.lock
      exit 0
    else
      exit 1
    fi

The filter-spool script was snagged straight from the procmail manpages
with some minor modifications to suit my configuration.  One change
is the script will returns a 0 exit status only if there was new mail.
This way I can conditionally call the web-archive script only if new
mail was present.

Here is the procmail resource file I use.  I am fairly new to using
procmail, so I do not know how optimal it is.  Note, mailing list
names have been modified from the actual names used.  Also, the
regexes used to look for list names do not bother with domain portions
of addresses since all the lists are internal to the company.  The
lists checked in the resource files are: listname, listname-dev,
listname2, listname2-dev:

    ########################################################################
    ##      Procmail resource file for MHonArc archives
    ########################################################################
    ##      This recipe is only responsible for storing messages within
    ##      mail folders.  A separate process will be used to actual
    ##      generate HTML archives.
    ########################################################################

    SHELL=/bin/sh
    UMASK=133
    PATH=/usr/local/bin:/usr/local/lib:/bin:/usr/bin
    BASEDIR=$HOME/archive

    LOGFILE=$BASEDIR/procmail.log

    ## Do alot of logging?
    #VERBOSE=yes

    ## Should deliveries be logged?
    LOGABSTRACT=yes

    ## Root path to mail folders
    MBOXROOT=$BASEDIR/mbox

    ## Current month: used as filename to store messages
    MONTHFOLDER=`date +"%Y-%m"`

    ## Pathname to MH inbox for non-list messages
    MH_INBOX=`mhpath +inbox`

    ## Flag if a list was matched
    HAVEMATCH=no

    ########################################################################
    ##      CVS Check-Ins
    ########################################################################
    ##      We only try to capture actual cvs check-ins.  Replies should
    ##      be archived with regular messages.
    ########################################################################

    :0
    * ^TO_listname-dev@
    * ^Subject: CVS commit
    {
      :0 Wic
      * ? test ! -d $MBOXROOT/listname-dev.CVS
      | mkdir -m 755 -p $MBOXROOT/listname-dev.CVS

      :0:
      $MBOXROOT/listname-dev.CVS/$MONTHFOLDER
    }

    :0
    * ^TO_listname2-dev@
    * ^Subject: CVS commit
    {
      :0 Wic
      * ? test ! -d $MBOXROOT/listname2-dev.CVS
      | mkdir -m 755 -p $MBOXROOT/listname2-dev.CVS

      :0:
      $MBOXROOT/listname2-dev.CVS/$MONTHFOLDER
    }

    ########################################################################
    ##      Discussion Lists
    ########################################################################
    ##      Since multiple lists may be specified for a given message
    ##      we must check for all list addresses instead of terminating
    ##      on the first match.
    ##      Messages are stored by month.
    ########################################################################

    ## listname
    :0
    * ^TO_listname@
    {
      :0 Wic
      HAVEMATCH=|echo yes

      :0 Wic
      * ? test ! -d $MBOXROOT/listname
      | mkdir -m 755 -p $MBOXROOT/listname

      :0 c:
      $MBOXROOT/listname/$MONTHFOLDER
    }

    ## listname2
    :0
    * ^TO_listname2@
    {
      :0 Wic
      HAVEMATCH=|echo yes

      :0 Wic
      * ? test ! -d $MBOXROOT/listname2
      | mkdir -m 755 -p $MBOXROOT/listname2

      :0 c:
      $MBOXROOT/listname2/$MONTHFOLDER
    }

    ## listname Dev
    :0
    * ^TO_listname-dev@
    {
      :0 Wic
      HAVEMATCH=|echo yes

      :0 Wic
      * ? test ! -d $MBOXROOT/listname-dev
      | mkdir -m 755 -p $MBOXROOT/listname-dev

      :0 c:
      $MBOXROOT/listname-dev/$MONTHFOLDER
    }

    ## listname2 Dev
    :0
    * ^TO_listname2-dev@
    {
      :0 Wic
      HAVEMATCH=|echo yes

      :0 Wic
      * ? test ! -d $MBOXROOT/listname2-dev
      | mkdir -m 755 -p $MBOXROOT/listname2-dev

      :0 c:
      $MBOXROOT/listname2-dev/$MONTHFOLDER
    }

    ########################################################################
    ##      Deliver to inbox (MH) if no matches 
    ########################################################################
    :0 :$MH_INBOX/$LOCKEXT
    * HAVEMATCH ?? no
    | rcvstore +$MH_INBOX

    ########################################################################
    ##      Fallback (should not get here)
    ########################################################################
    :0
    /dev/null


The final script is web-archive that actually creates the MHonArc
archives.  The script is written in Perl.  It checks monthly mailbox
storage area to see what folders have been updated recently.  It uses
the above directory layout described above to determine list names and
which mailbox files need to be processed:

    #!/usr/local/bin/perl

    use lib '/spare/lib/perl5/site_perl/5.005';

    require 'mhamain.pl';

    my $HOME                = $ENV{'HOME'} || '/spare/users/mhonarc';
    my $HTML_DIR            = "$HOME/archive/html";
    my $MBOX_DIR            = "$HOME/archive/mbox";
    my $MHA_RC              = "$HOME/archive/common.mrc";
    my $MHA_MAXSIZE         = $ENV{'WA_MAXSIZE'} || 100;
    my $MTIME_THRESH        = $ENV{'WA_MTIME_THRESH'} || 86400;     # one day

    MAIN: {
      my $rebuild = $ENV{'WA_REBUILD'};
      my $debug = $ENV{'WA_DEBUG'};
      my $time = time;

      if ($debug) {
        print "HTML_DIR=$HTML_DIR\n",
              "MBOX_DIR=$MBOX_DIR\n",
              "MHA_RC=$MHA_RC\n",
              "MHA_MAXSIZE=$MHA_MAXSIZE\n",
              "MTIME_THRESH=$MTIME_THRESH\n";
        print "rebuild=$rebuild\n",
              "time=$time\n";
      }

      mhonarc::initialize();
      print "MHonArc initialized.\n"  if $debug;

      local(*DIR);

      print "Reading $MBOX_DIR.\n"  if $debug;
      opendir(DIR, $MBOX_DIR) || die qq/Unable to open "$MBOX_DIR": $!/;
      my @dirs = grep { (-d "$MBOX_DIR/$_") &&
                        ($_ ne '.') &&
                        ($_ ne '..')
                      } readdir(DIR);
      closedir(DIR);

      my(@months, @folders);
      my($dir, $list, $mon, $mondir, $htmldir, $cvs, $title, $mtime);

      print "Lists: ", join(', ', @dirs), "\n"  if $debug;
      foreach $list (@dirs) {
        print "Processing $list ...\n"  if $debug;

        $cvs = 0;

        $dir = join('/', $MBOX_DIR, $list);
        if (!opendir(DIR, $dir)) {
          warn qq/Unable to open "$dir": $!/;
          next;
        }
        @months = grep { /^\d+-\d+$/ } readdir(DIR);
        closedir(DIR);
        print "Months: ", join(', ', @months), "\n"  if $debug;

        @folders = ();
        foreach $mon (@months) {
          $mondir = join('/', $dir, $mon);
          if ($rebuild) {
            push(@folders, $mondir);
            next;
          }
          $mtime = (stat($mondir))[9];
          print "$mondir mtime: $mtime\n"  if $debug;
          if (($time - $mtime) < $MTIME_THRESH) {
            push(@folders, $mondir);
          }
        }

        next  if (!(_at_)folders);
        print "Folders: ", join(', ', @folders), "\n"  if $debug;

        $htmldir = join('/', $HTML_DIR, $list);
        if ($rebuild) {
          print "Removing $htmldir\n"  if $debug;
          system('/bin/rm', '-r', $htmldir);
        }
        mkdir($htmldir, 0777);

        $cvs = $list =~ /\.CVS/;
        ($title) = $list =~ /([^.]+)/;

        @mhaargs = (
          '-maxsize', $MHA_MAXSIZE,
          '-rcfile', $MHA_RC,
          '-outdir' , $htmldir,
          '-title', "$title (date)",
          '-ttitle', "$title (thread)",
        );
        if ($cvs) {
          push(@mhaargs, '-nothread');
        } else {
          push(@mhaargs, '-thread');
        }
        if (!$debug && !$rebuild) {
          push(@mhaargs, '-quiet');
        }
        if (!$rebuild) {
          push(@mhaargs, '-add');
        }
        print "MHonArc Options: ", join(' ', @mhaargs), "\n"  if $debug;

        mhonarc::process_input(@mhaargs, @folders);
      }

    }


You will notice I use the simple MHonArc API to process the archives.
This avoids the forking off a shell process to invoke mhonarc.

Some notes:

    .   The method I chose to use is independent on how the mailing
        lists themselves are managed.  In this case, lists are managed
        by Majordomo and administered by someone else.

    .   I chose not to filter mail into the monthly mailbox files as
        they come in since it would require knowledge of the MTA.  All
        I care about is where is the mail initially delivered, transfer
        the mail to a work area, and then process it.


The example I provided may or may not be helpful to you.  However,
seeing examples does help you figure out what is possible and what
may work for you.

        --ewh