Re: Feature request

Earl Hood wrote:

On December 18, 2005 at 00:11, Ken Bass wrote:
2) Add a resource that contains the 'input' filename. Basically, I wouldlike there to be a mapping between the generated data and the inputfile. I am thinking MHonarc should add something like:

This his has been discussed in the past.  The input filename is
not always known, and it varies based on the style of input: mbox
or mh.  What does exist are callbacks (see the API appendix and
$mhonarc::CBMessageConverted) that provides hooks.  The callbacks
were added due to a request by a user.

When I saw that API callback the other day, I was initially excited.But when I looked in detail, it did not seem like I had access to themessage header or body. It would have been usefull if the API passed insome type of hash/assoc array so user defined fields/comments could bepassed back to into the message being converted. I had to abandon thisroute.

Another option I started to implement was to add an '$INPUTFILE$'resource variable (kind of like $MSG$ but for the input). This wouldallow the flexibility so the feature is optional and allow the name tobe used in many ways - meta tags, comments, URLs, etc. The user couldsimply add their own tags however they want in their output.

I also looked at annotation but that didn't pan out either unless msgswere added one at a time.

When converting large archives, sometimes there are errors in theprocessing and at least by examining the HTML source you could seewhich input file causes it. The message ID is not always useful,especially when MHonarch generate its own id.
Agree on the last part.  If you are processing news spools, why
are there no message-ids?

That is my delimma. My archive is from 1996 to present. For certainyears the messages were from a mailing list and other years a newsgroup.I recently reorganized/expanded my archive and upgraded to the latestversion. In the process, I added hundreds of thousands of messages. WhenI viewed the cronological view, there were some entries that had emptybodies with subject of '[no subject]', author 'Unknown', with todaysdate. Without a way to map them, I have no way to trace to the input andsee what is wrong. For the cases of no message id's, I found some 'tempfiles' among these messages and some 0 length messages. Those files wereprocessed by mhonarc and resulted in some of the mystery entries.

Some the other 'input problems' I encountered during this archiverebuild were:


Warning: Unrecognized character set: x-user-defined
Warning: Unrecognized time zone, ","
Warning: Could not parse date for message
Warning: Unrecognized character set: utf-7
Warning: Unrecognized character set: ibm850
Warning: Unrecognized character set: ibm864
Warning: Unrecognized time zone, "-5:00"
Warning: Bad year (1956) using current
Warning: No end boundary delimiter found in message body

Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/base64.plline 91, <GEN70164> line 18.

Even with a message id available, grepping through hundreds of thousandsof messages for each warning takes a while and really slowed down theprocess.

For normal everyday additions from a news spool which is what happensnormally, I agree that there should be a message id. In my case I wasprocessing older messages which led me down this path.

A problem with tracking the filename is it increases the amount of
data stored in the dbfile.  The callback API could be used to track
the info for those interested.  Alternatively, mhonarc could be
modified to have diagnostic data in message pages that are preserved
during edits so the info is not lost (which would require a new
deliminting token to preserve such info).

I modified my mhonarc and added a '%InputFile' hash which stores thefilename. I set it after read_mail_header() call when the input is adirectory. During output_mail(), if it is defined (which it wont be forsingle adds or adds from stdin) I add a ''. I did not add it to the database whichI guess means it could not be recreated? This could probably be usedwith a $INPUTFILE$ resource variable, but I could not understand thecode wrt mapping the index to the key during variable substitution.

Of course, if such changes were made, the feature would be optional
since revealing such information could be a security concern for
users.

I thought about that too but in my specific case it did not bother me.The filename are just numbers organized in Year/Mon directories (thoughit does expose the name of a user account). Others might be concernedwith this of course. With this mapping, when I see a problem message inthe archive, I simple 'view page source' and can see immediately whatfile caused it. Due to the size of the archive, I'm considering puttinga 'report this message' link in the TOPLINKS of each msg so that userscan report odd stuff (or illegal content/porn/etc). Being able to mapfrom the page the user visits to the original file would be helpful inthis case also.