Script for Converting Hypermail archives to MHonArc archives

Hello; 

    Others on this list have asked for a way to convert their old
Hypermail archives to MHonArc style archives, when the original mailbox
source is no longer available. Attached is a perl script which converts
html files generated by hypermail into a "mailbox" format. This is not
true a BSD style mailbox, but MHonArc (V2.1.0 and V2.2.0) is happy with
parsing the new "mailbox".

 See http://www.albany.net/~anthonyw/archivedemo/ for an example which
uses some recent posts to the mhonarc mailing list.

 When dealing with archives which span a number of months, one may modify
the script to write to more than one file. One could get the month
information from the $sent_date field and from that one could decide on
which file the output should be written to.

 Script is provided AS IS, YMMV.

Regards,

AnthonyW

#!/usr/local/bin/perl 

##############################################################################
#
# h2mbx.pl A script to convert hypermail html archives into "mailbox"
# format.
# 
# As is, no warranty
# 
# Usage:  
#     ./scriptname hypermail-html-filenames*.html 
#     cat hyermail-html-filename*.html |./scriptname
#
##############################################################################
#
# General concept:
# 
#   This is an exercise in parsing a file that has logical sections.
# Find out if one is in a particular logical secion and act accordingly.
# 
# This script was written against a hypermail generated html file
# which has the following structure:
#
#       <!-- received="date-time-stamp" -->
#       .... information to extract ...
#       <!-- body="start" -->
#       .... information to extract ...
#       <the first blank line>
#       .... message text ...
#       <!-- body="end" -->
#       .... information to ignore ...
#
# If your hypermail pages have slightly different structure, modify the script
# according to the structure you have in place.
#

       
$filebegin = "false" ; 

open (OUTFILE, ">>newmailbox.txt"); # Open and append to our output file

while (<>)
{ 
               s/\&gt\;/>/g;        # decode > 
               s/\&lt\;/</g;        # decode <
    #
    # Find out if we are entering a new Start section
    #
 
    if (/\<\!--\ received\=\"/)
    { 
       print OUTFILE @body;         # Print the current message buffer 

       # reset our flags

       $isinheaders = "false";
       $isinbody = "false";
       $filebegin = "true" ;   
       $isintail = "false"; 
       @body = ();            
       next;
    }

    if ($filebegin eq "true")
    {
            chop();

            if (/\<\!--\ sent\=\"/)
            {
               # Collect the sent date 
               s/.*\=\"//g;
               s/\"\ -->.*//g;
               $sentdate = $_;
               next;
            }
            if (/\<\!--\ name\=\"/)
            {
               # Collect the RFC 822 Phrase (Personal name)
               s/.*\=\"//g;
               s/\"\ -->.*//g;
               $personalname = $_;
               next;
            }
            if (/\<\!--\ email\=\"/)
            {
               # Collect the RFC 822 email address
               s/.*\=\"//g;
               s/\"\ -->.*//g;
               $from = $_;
               next;
            }
            if (/\<\!--\ subject\=\"/)
            {
               # Collect the subject
               s/.*\=\"//g;
               s/\"\ -->.*//g;
               $subject = $_;
               next;
            }
            if (/\<\!--\ id\=\"/)
            {
               # Collect the Message Id 
               s/.*\=\"//g;
               s/\"\ -->.*//g;
               $messageid = $_;
               next;
            }
            if (/\<\!--\ inreplyto\=\"/)
            {
               # Collect the inreplyto field 
               s/.*\=\"//g;
               s/\"\ -->.*//g;
               $inreplyto = $_;
               next;
            }
            if (/\<\!--\ body\=\"start/)
            {
               $isinheaders = "true";
               $filebegin = "false";
               next;
            }
     }
     if ( $isinheaders =~ /true/ )
     {
            chop();            

            if (/^$/) # look out for that first blank line.
            {
               $isinheaders = "false";
               $isinbody = "true";
               next;
            }
            if (/^To:/)
            {
               # Collect the to addressees
               s/To://g;
               $to_addr = $_;
 
               #
               # Now that we have the to_addr and since it is the last 
               # item in the headers this means that we can now build
               # the mbox style headers.
               #

               push (@body, "\nFrom $from $sentdate\n" ) ;

               if ($messageid ne "") {
                 push (@body, "Message-id: <$messageid>\n" ) ;
               }

               push (@body, "Date: $sentdate\n");

               if ($personalname  =~ /\@/) {
                 push (@body, "From: $from\n");
               } else {
                 push (@body, "From: $personalname <$from>\n");
               }

               push (@body, "To:$to_addr\n" );
               push (@body, "Subject: $subject\n" );
               if ($inreplyto ne "") {
                 push (@body, "In-Reply-to: <$inreplyto>\n\n" ) ;
               }
               next;
            }
     }
     if ($isinbody =~ /true/ )
     {

               if (/\<\!--\ body\=\"end\"\ --\>/)
               {
                  $isintail = "true" ; 
                  next; 
               }
               next if (/\<h1\>\<center\>/);   
               next if (/\<\/center\>/);   
               next if ( $isintail =~ /true/) ;

               # Extract URLs 
               s/\<a\ href\=\"(.*)"\>(.*)\<\/a\>/\2/g;

               s/\<pre>//g;         # remove pre
               s/\<\/pre>//g;       
               s/\<i>//g;           # remove italics
               s/\<\/i>//g;
               s/\<br\>//g;         # remove linebreaks
               s/\<b>//g;           # remove bolds
               s/\<\/b>//g;
               s/\<hr.*>//g;        # hr's
               s/\&gt\;/>/g;        # decode > 
               s/\&lt\;/</g;        # decode <
               s/\<p\>//g;          # turn <p> into CR
               
               s/^From\ />From\ /g; # Watch out for forwarded or quoted mail.

               # Collect the current line

               push (@body, $_ ) ;
     }
}
       
print OUTFILE @body;

print "Processing complete\n";

exit;