Re: Feature request

On December 20, 2005 at 15:27, Ken Bass wrote:

  When I saw that API callback the other day, I was initially excited. 
But when I looked in detail, it did not seem like I had access to the 
message header or body. It would have been usefull if the API passed in 
some type of hash/assoc array so user defined fields/comments could be 
passed back to into the message being converted. I had to abandon this 
route.


Which API calls did you look at? $mhonarc::CBMessageConverted
provides header info along with the filename info.  I'm guessing that
CBMessageConverted may be too late for you?  It appears you want the
filename info in one of the header-read-based callbacks.  Correct?

Agree on the last part.  If you are processing news spools, why
are there no message-ids?


  That is my delimma. My archive is from 1996 to present. For certain 
years the messages were from a mailing list and other years a newsgroup. 
I recently reorganized/expanded my archive and upgraded to the latest 
version. In the process, I added hundreds of thousands of messages. When 
I viewed the cronological view, there were some entries that had empty 
bodies with subject of '[no subject]', author 'Unknown', with todays 
date. Without a way to map them, I have no way to trace to the input and 
see what is wrong. For the cases of no message id's, I found some 'temp 
files' among these messages and some 0 length messages. Those files were 
processed by mhonarc and resulted in some of the mystery entries.


One option is to added a message-id to each message before passing
the data to mhonarc.  I.e. Do some pre-processing on the data to
clean things up before passing to mhonarc.  The pre-processing could
include deleting 0 byte files.

Some the other 'input problems' I encountered during this archive 
rebuild were:

Warning: Unrecognized character set: x-user-defined


See charsetaliases and charsetconverters.  If a charset is not
recognized, mhonarc fallbacks to the default charset, us-acii (which
can be changed via a resource).

Warning: Unrecognized time zone, ","


I'm guessing a strange date format.

Warning: Unrecognized time zone, "-5:00"


Numeric time offsets are not supposed to have a ':'.

Warning: No end boundary delimiter found in message body


A MIME multipart with no end boundary.  The code is pretty good
at dealing with this, so usually the warning can be ignored.

Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/base64.pl 
line 91, <GEN70164> line 18.


Base64 encoded data is badly formatted.
(Side note: You should upgrade from Perl 5.8.0.  I think 5.8.0 is
kind of buggy).

Even with a message id available, grepping through hundreds of thousands 
of messages for each warning takes a while and really slowed down the 
process.


Yep.

I agree that a more verbose operating mode will be useful, like a -debug
that prints out much more detail about what is going on.  I've had
a few cases myself where it would have been handy.

Of course, such a mode is not handy in cases where a problem is discovered
later on any you want to know which input message maps to a given
HTML file.

--ewh