mhonarc-users

Re: Feature request

2005-12-20 14:27:52
Earl Hood wrote:
On December 18, 2005 at 00:11, Ken Bass wrote:

2) Add a resource that contains the 'input' filename. Basically, I would like there to be a mapping between the generated data and the input file. I am thinking MHonarc should add something like:
<!--X-Input-Filename: /var/archive/mbox/file1.txt -->


This his has been discussed in the past.  The input filename is
not always known, and it varies based on the style of input: mbox
or mh.  What does exist are callbacks (see the API appendix and
$mhonarc::CBMessageConverted) that provides hooks.  The callbacks
were added due to a request by a user.

When I saw that API callback the other day, I was initially excited. But when I looked in detail, it did not seem like I had access to the message header or body. It would have been usefull if the API passed in some type of hash/assoc array so user defined fields/comments could be passed back to into the message being converted. I had to abandon this route.

Another option I started to implement was to add an '$INPUTFILE$' resource variable (kind of like $MSG$ but for the input). This would allow the flexibility so the feature is optional and allow the name to be used in many ways - meta tags, comments, URLs, etc. The user could simply add their own tags however they want in their output.

I also looked at annotation but that didn't pan out either unless msgs were added one at a time.

When converting large archives, sometimes there are errors in the processing and at least by examining the HTML source you could see which input file causes it. The message ID is not always useful, especially when MHonarch generate its own id.


Agree on the last part.  If you are processing news spools, why
are there no message-ids?

That is my delimma. My archive is from 1996 to present. For certain years the messages were from a mailing list and other years a newsgroup. I recently reorganized/expanded my archive and upgraded to the latest version. In the process, I added hundreds of thousands of messages. When I viewed the cronological view, there were some entries that had empty bodies with subject of '[no subject]', author 'Unknown', with todays date. Without a way to map them, I have no way to trace to the input and see what is wrong. For the cases of no message id's, I found some 'temp files' among these messages and some 0 length messages. Those files were processed by mhonarc and resulted in some of the mystery entries.

Some the other 'input problems' I encountered during this archive rebuild were:

Warning: Unrecognized character set: x-user-defined
Warning: Unrecognized time zone, ","
Warning: Could not parse date for message
Warning: Unrecognized character set: utf-7
Warning: Unrecognized character set: ibm850
Warning: Unrecognized character set: ibm864
Warning: Unrecognized time zone, "-5:00"
Warning: Bad year (1956) using current
Warning: No end boundary delimiter found in message body
Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/base64.pl line 91, <GEN70164> line 18.

Even with a message id available, grepping through hundreds of thousands of messages for each warning takes a while and really slowed down the process.

For normal everyday additions from a news spool which is what happens normally, I agree that there should be a message id. In my case I was processing older messages which led me down this path.

A problem with tracking the filename is it increases the amount of
data stored in the dbfile.  The callback API could be used to track
the info for those interested.  Alternatively, mhonarc could be
modified to have diagnostic data in message pages that are preserved
during edits so the info is not lost (which would require a new
deliminting token to preserve such info).

I modified my mhonarc and added a '%InputFile' hash which stores the filename. I set it after read_mail_header() call when the input is a directory. During output_mail(), if it is defined (which it wont be for single adds or adds from stdin) I add a '<!--X-InputFile: /var/archive/mbox/file1.txt -->'. I did not add it to the database which I guess means it could not be recreated? This could probably be used with a $INPUTFILE$ resource variable, but I could not understand the code wrt mapping the index to the key during variable substitution.

Of course, if such changes were made, the feature would be optional
since revealing such information could be a security concern for
users.

I thought about that too but in my specific case it did not bother me. The filename are just numbers organized in Year/Mon directories (though it does expose the name of a user account). Others might be concerned with this of course. With this mapping, when I see a problem message in the archive, I simple 'view page source' and can see immediately what file caused it. Due to the size of the archive, I'm considering putting a 'report this message' link in the TOPLINKS of each msg so that users can report odd stuff (or illegal content/porn/etc). Being able to map from the page the user visits to the original file would be helpful in this case also.

<Prev in Thread] Current Thread [Next in Thread>