(Resent with scripts attached in a zip file to maintain formatting...)
Don't know about anyone else, but I've found that there's a lot of good
information in some Yahoo Groups. However, I'm pretty frustrated by
their search engine, so I thought it would be best to snarf a copy of
all the messages and use nmh to view everything. I found a utility
called "grabyahoogroup" on SourceForge and sucked all the messages from
a group into a folder in my nmh directory. (The regular expressions
needed a bit of tweaking, but I got it on the third try and messages
started showing up.) So far, so good.
However, Yahoo seems to strip the whitespace from the front of header
continuation lines, and nmh doesn't handle that properly. When I ran
scan on the newly downloaded files, I got bogus dates, no from field,
and no subject line. I started to look at m_getfld.c, but got impatient
(laziness, impatience and hubris, right?) and decided to slap something
together outside of nmh. Here's what I came up with...
First, this script just prints all continuation lines in the header of
each file in the current directory:
printheader.py
--------------
#!/bin/env python
import glob
import re
filelist = glob.glob("[0-9]*")
filelist.sort(key=int)
# Messages start with "From "
fromfield = re.compile('^From ')
# Header keywords start with a capital letter and end with a colon
headerfield = re.compile('^[A-Z][A-Za-z_-]*?:')
for file in filelist:
body = False
infile = open(file, "r")
for line in infile:
if body:
pass
elif line.rstrip() == "":
body = True
else:
if not fromfield.match(line) and not headerfield.match(line):
print file, " ", line.rstrip()
infile.close()
Second, this script consolidates header lines:
modheader.py
------------
#!/bin/env python
import glob
import re
filelist = glob.glob("[0-9]*")
filelist.sort(key=int)
# Messages start with "From "
fromfield = re.compile('^From ')
# Header keywords start with a capital letter and end with a colon
headerfield = re.compile('^[A-Z][A-Za-z_-]*?:')
for file in filelist:
body = False
infile = open(file, "r")
for line in infile:
if body:
pass
elif line.rstrip() == "":
body = True
else:
if not fromfield.match(line) and not headerfield.match(line):
print file, " ", line.rstrip()
infile.close()
Of course, the scripts could be made more efficient, they could be
combined, the second one could insert whitespace instead of
concatenating, check line length, or handle temp files more gracefully,
etc. Since I only needed to do this conversion once, it wasn't worth a
lot of time...
Is this a general enough problem that there could be a need to do this
kind of thing within nmh?
Regards,
Doug
_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers