mharc-users

Re: [approved] Re: [approved] NEWBE: Immediate rebuilt of archive

2003-05-09 09:41:45
On May 9, 2003 at 13:47, Steffen Kaiser wrote:

Ah, I suspected this behaviour is just _one_ way.
I didn't actually like it, because it requires me to maintain two list
definitions: one in mailman and one for filtering using mharc.

I do not know the specifics of mailman, but it may be possible to
have a script that auto-generates a lists.def from the mailman definitions.

I should note that others have replaced the pipermail piece of mailman
with mharc, like at <http://mail.gnu.org/archive/html/>.  The
GNU folks use mailman (or some form of it) to help manage project
lists at savannah, but replaced the pipermail archives with mharc.
I do not know the specifics of how they set things up.

Of course, I understand that using a subscribed user makes you independend
on the mailist software and, furthermore, you needn't run the archiving on
the same host as the list processor.

This was a primary goal.  It also allows you to run archives for
lists where each list uses different list management software.  Also,
you can combine multiple lists into one archive.  I do this for
my own private archives.

Now, it is possible to support alternative "input" methods into mharc.
The ORGMAIL <mharc-root>/lib/config.sh allows you to specifying any
"incoming" mailbox file.   Also, there are the mh-month-pack and
mbox-month-pack scripts that can be used.  If using these techniques,
you would not use the read-mail and filter-spool script components, but
just the web-archive script.

OK. I used the web-archive script to built the initial HTML archive (along
with the nice search interface etc.) from the mailboxes created by
mailman.

You could technically keep this model.  The only potential gotcha would
be syncronization between mailman and web-archive.  I.e. You do not
want mailman to be updating a mailbox file when web-archive is scanning
it.  You have a race condition where it is possible that web-archive
may archive a partial message that mailman has not finished writing.
You will need to verify how mailman updates the mailbox files to
see if there is an actual race condition and if it is possible to
include some syncronization.

For your case, you could have a script that mailman invokes for
each message it receives to append a copy to a mailbox file of your
choosing (you can use procmail to insure safe delivery).  Then, set
the ORGMAIL config.sh variable to that spool file (also make sure to
set IS_MAIL_SPOOL to the proper value).  The cron scripts will do
the rest as long as you define lists.def properly.

Do I understood the process correctly that I have to:

a) mimicking file-spool by filing (aka appending) the new message into
~mharc/mbox/listname/$( ~mharc/bin/extract-mesg-date -fmt '%Y-%m' )

in order to create the "raw" mailboxes.
(mharc way: read-mail -> filter-spool -> procmail ->
~mharc/procmailrc.mharc)

Perhaps I can even use the "arriving" (aka current system) time bypassing
extract-mesg-date?

That is completely up to you.  extract-mesg-date was added to mharc
to deal with cases when the system may be down or there are other
administration items that may delay mail processing.  extract-mesg-date
insures that the message is archived according to the date/time
it was received vs when it was processed.

It also allows you to synchronize the dates used in filtering as
they are used by mhonarc.  Remember, mhonarc uses the date/time
information within the message itself and not the current system time.
extract-mesg-date mirrors this behavior.

This is important for the boundary case scenarios where you have
mail that is received near the end of one period and the beginning of
the next.  If you use the "current system time" when raw filtering,
you could place a message that belongs to the end of one period in
the next period.  Then in the HTML archives, you will have a message
(or messages) listed with a date of the previous period.

For example, say a message has a date of Jan 31, but the message is
"raw" filtered on Feb 1.  The message will be placed in the Feb
period and not the Jan period.  When mhonarc processes the Feb
period, you will have a Jan 31 date show up in the date index for Feb.

Note, the MSG_DATE_FIELDS config.sh variable needs to be in sync
with MHonArc's DATEFIELDS resource to insure that the same date/time
is used by both components.

b) and finally run ~mharc/bin/web-archive e.g. once an hour to propagte
the changes made to the raw mailboxes into the HTML archive?

(mharc way: via cron: read-mail -> web-archive )

Technically, if you are using your own raw filtering component, you
do not even use read-mail and filter-spool.  You'd just use the
web-archive piece.  The "once an hour" is only a convention.  You
can change it depending on your system configuration and requirements.

mharc maintains the raw mailbox files to facilitate HTML archive
recovering and rebuilds.  The crontab is set up to gzip compress mailbox
files that have not been touched in a long time.

There are some remarks in the man page of some commands about not using
compressed archives, because they won't support it. The idea is then to
hope that no further mail arrives in this "compressed" period, right?

This applies only to the special "import" scripts mh-month-pack and
mbox-month-pack.  These scripts are not invoked automatically by
default.  If you customize things to invoke them automatically, you
are correct that you hope there is not a message that would be
targeted for a compressed period.  Possible solutions: Never compress
older mail, or try to update the scripts to detect a compressed period
and uncompress it automatically.

You are correct that if you do not want the old data, you can remove
the unix mailbox files that you no longer want.

Here you are reffering to the broken up raw mailboxes
~mharc/mbox/listname/period, right?

Yes.  Note, the HTML archives of the periods deleted will still be
present.  They would only vanish if you did a full rebuild of the
archives or manually delete the HTML periods also.  Deleting the
HTML periods would require an update to the search index.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHARC-USERS

<Prev in Thread] Current Thread [Next in Thread>