Earl Hood wrote:
On December 18, 2005 at 00:11, Ken Bass wrote:
2) Add a resource that contains the 'input' filename. Basically, I would
like there to be a mapping between the generated data and the input
file. I am thinking MHonarc should add something like:
<!--X-Input-Filename: /var/archive/mbox/file1.txt -->
This his has been discussed in the past. The input filename is
not always known, and it varies based on the style of input: mbox
or mh. What does exist are callbacks (see the API appendix and
$mhonarc::CBMessageConverted) that provides hooks. The callbacks
were added due to a request by a user.
When I saw that API callback the other day, I was initially excited.
But when I looked in detail, it did not seem like I had access to the
message header or body. It would have been usefull if the API passed in
some type of hash/assoc array so user defined fields/comments could be
passed back to into the message being converted. I had to abandon this
Another option I started to implement was to add an '$INPUTFILE$'
resource variable (kind of like $MSG$ but for the input). This would
allow the flexibility so the feature is optional and allow the name to
be used in many ways - meta tags, comments, URLs, etc. The user could
simply add their own tags however they want in their output.
I also looked at annotation but that didn't pan out either unless msgs
were added one at a time.
When converting large archives, sometimes there are errors in the
processing and at least by examining the HTML source you could see
which input file causes it. The message ID is not always useful,
especially when MHonarch generate its own id.
Agree on the last part. If you are processing news spools, why
are there no message-ids?
That is my delimma. My archive is from 1996 to present. For certain
years the messages were from a mailing list and other years a newsgroup.
I recently reorganized/expanded my archive and upgraded to the latest
version. In the process, I added hundreds of thousands of messages. When
I viewed the cronological view, there were some entries that had empty
bodies with subject of '[no subject]', author 'Unknown', with todays
date. Without a way to map them, I have no way to trace to the input and
see what is wrong. For the cases of no message id's, I found some 'temp
files' among these messages and some 0 length messages. Those files were
processed by mhonarc and resulted in some of the mystery entries.
Some the other 'input problems' I encountered during this archive
Warning: Unrecognized character set: x-user-defined
Warning: Unrecognized time zone, ","
Warning: Could not parse date for message
Warning: Unrecognized character set: utf-7
Warning: Unrecognized character set: ibm850
Warning: Unrecognized character set: ibm864
Warning: Unrecognized time zone, "-5:00"
Warning: Bad year (1956) using current
Warning: No end boundary delimiter found in message body
Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/base64.pl
line 91, <GEN70164> line 18.
Even with a message id available, grepping through hundreds of thousands
of messages for each warning takes a while and really slowed down the
For normal everyday additions from a news spool which is what happens
normally, I agree that there should be a message id. In my case I was
processing older messages which led me down this path.
A problem with tracking the filename is it increases the amount of
data stored in the dbfile. The callback API could be used to track
the info for those interested. Alternatively, mhonarc could be
modified to have diagnostic data in message pages that are preserved
during edits so the info is not lost (which would require a new
deliminting token to preserve such info).
I modified my mhonarc and added a '%InputFile' hash which stores the
filename. I set it after read_mail_header() call when the input is a
directory. During output_mail(), if it is defined (which it wont be for
single adds or adds from stdin) I add a '<!--X-InputFile:
/var/archive/mbox/file1.txt -->'. I did not add it to the database which
I guess means it could not be recreated? This could probably be used
with a $INPUTFILE$ resource variable, but I could not understand the
code wrt mapping the index to the key during variable substitution.
Of course, if such changes were made, the feature would be optional
since revealing such information could be a security concern for
I thought about that too but in my specific case it did not bother me.
The filename are just numbers organized in Year/Mon directories (though
it does expose the name of a user account). Others might be concerned
with this of course. With this mapping, when I see a problem message in
the archive, I simple 'view page source' and can see immediately what
file caused it. Due to the size of the archive, I'm considering putting
a 'report this message' link in the TOPLINKS of each msg so that users
can report odd stuff (or illegal content/porn/etc). Being able to map
from the page the user visits to the original file would be helpful in
this case also.