Hello;
Others on this list have asked for a way to convert their old
Hypermail archives to MHonArc style archives, when the original mailbox
source is no longer available. Attached is a perl script which converts
html files generated by hypermail into a "mailbox" format. This is not
true a BSD style mailbox, but MHonArc (V2.1.0 and V2.2.0) is happy with
parsing the new "mailbox".
See http://www.albany.net/~anthonyw/archivedemo/ for an example which
uses some recent posts to the mhonarc mailing list.
When dealing with archives which span a number of months, one may modify
the script to write to more than one file. One could get the month
information from the $sent_date field and from that one could decide on
which file the output should be written to.
Script is provided AS IS, YMMV.
Regards,
AnthonyW
#!/usr/local/bin/perl
##############################################################################
#
# h2mbx.pl A script to convert hypermail html archives into "mailbox"
# format.
#
# As is, no warranty
#
# Usage:
# ./scriptname hypermail-html-filenames*.html
# cat hyermail-html-filename*.html |./scriptname
#
##############################################################################
#
# General concept:
#
# This is an exercise in parsing a file that has logical sections.
# Find out if one is in a particular logical secion and act accordingly.
#
# This script was written against a hypermail generated html file
# which has the following structure:
#
# <!-- received="date-time-stamp" -->
# .... information to extract ...
# <!-- body="start" -->
# .... information to extract ...
# <the first blank line>
# .... message text ...
# <!-- body="end" -->
# .... information to ignore ...
#
# If your hypermail pages have slightly different structure, modify the script
# according to the structure you have in place.
#
$filebegin = "false" ;
open (OUTFILE, ">>newmailbox.txt"); # Open and append to our output file
while (<>)
{
s/\>\;/>/g; # decode >
s/\<\;/</g; # decode <
#
# Find out if we are entering a new Start section
#
if (/\<\!--\ received\=\"/)
{
print OUTFILE @body; # Print the current message buffer
# reset our flags
$isinheaders = "false";
$isinbody = "false";
$filebegin = "true" ;
$isintail = "false";
@body = ();
next;
}
if ($filebegin eq "true")
{
chop();
if (/\<\!--\ sent\=\"/)
{
# Collect the sent date
s/.*\=\"//g;
s/\"\ -->.*//g;
$sentdate = $_;
next;
}
if (/\<\!--\ name\=\"/)
{
# Collect the RFC 822 Phrase (Personal name)
s/.*\=\"//g;
s/\"\ -->.*//g;
$personalname = $_;
next;
}
if (/\<\!--\ email\=\"/)
{
# Collect the RFC 822 email address
s/.*\=\"//g;
s/\"\ -->.*//g;
$from = $_;
next;
}
if (/\<\!--\ subject\=\"/)
{
# Collect the subject
s/.*\=\"//g;
s/\"\ -->.*//g;
$subject = $_;
next;
}
if (/\<\!--\ id\=\"/)
{
# Collect the Message Id
s/.*\=\"//g;
s/\"\ -->.*//g;
$messageid = $_;
next;
}
if (/\<\!--\ inreplyto\=\"/)
{
# Collect the inreplyto field
s/.*\=\"//g;
s/\"\ -->.*//g;
$inreplyto = $_;
next;
}
if (/\<\!--\ body\=\"start/)
{
$isinheaders = "true";
$filebegin = "false";
next;
}
}
if ( $isinheaders =~ /true/ )
{
chop();
if (/^$/) # look out for that first blank line.
{
$isinheaders = "false";
$isinbody = "true";
next;
}
if (/^To:/)
{
# Collect the to addressees
s/To://g;
$to_addr = $_;
#
# Now that we have the to_addr and since it is the last
# item in the headers this means that we can now build
# the mbox style headers.
#
push (@body, "\nFrom $from $sentdate\n" ) ;
if ($messageid ne "") {
push (@body, "Message-id: <$messageid>\n" ) ;
}
push (@body, "Date: $sentdate\n");
if ($personalname =~ /\@/) {
push (@body, "From: $from\n");
} else {
push (@body, "From: $personalname <$from>\n");
}
push (@body, "To:$to_addr\n" );
push (@body, "Subject: $subject\n" );
if ($inreplyto ne "") {
push (@body, "In-Reply-to: <$inreplyto>\n\n" ) ;
}
next;
}
}
if ($isinbody =~ /true/ )
{
if (/\<\!--\ body\=\"end\"\ --\>/)
{
$isintail = "true" ;
next;
}
next if (/\<h1\>\<center\>/);
next if (/\<\/center\>/);
next if ( $isintail =~ /true/) ;
# Extract URLs
s/\<a\ href\=\"(.*)"\>(.*)\<\/a\>/\2/g;
s/\<pre>//g; # remove pre
s/\<\/pre>//g;
s/\<i>//g; # remove italics
s/\<\/i>//g;
s/\<br\>//g; # remove linebreaks
s/\<b>//g; # remove bolds
s/\<\/b>//g;
s/\<hr.*>//g; # hr's
s/\>\;/>/g; # decode >
s/\<\;/</g; # decode <
s/\<p\>//g; # turn <p> into CR
s/^From\ />From\ /g; # Watch out for forwarded or quoted mail.
# Collect the current line
push (@body, $_ ) ;
}
}
print OUTFILE @body;
print "Processing complete\n";
exit;