[Nmh-workers] More robust header parsing...? Yahoo groups problems. Head

(Resent with scripts attached in a zip file to maintain formatting...)

Don't know about anyone else, but I've found that there's a lot of good
information in some Yahoo Groups. However, I'm pretty frustrated by
their search engine, so I thought it would be best to snarf a copy of
all the messages and use nmh to view everything. I found a utility
called "grabyahoogroup" on SourceForge and sucked all the messages from
a group into a folder in my nmh directory. (The regular expressions
needed a bit of tweaking, but I got it on the third try and messages
started showing up.) So far, so good.

However, Yahoo seems to strip the whitespace from the front of header
continuation lines, and nmh doesn't handle that properly. When I ran
scan on the newly downloaded files, I got bogus dates, no from field,
and no subject line. I started to look at m_getfld.c, but got impatient
(laziness, impatience and hubris, right?) and decided to slap something
together outside of nmh. Here's what I came up with...

First, this script just prints all continuation lines in the header of
each file in the current directory:

printheader.py
--------------
#!/bin/env python

import glob
import re

filelist = glob.glob("[0-9]*")
filelist.sort(key=int)

# Messages start with "From "
fromfield = re.compile('^From ')
# Header keywords start with a capital letter and end with a colon
headerfield = re.compile('^[A-Z][A-Za-z_-]*?:')

for file in filelist:
  body = False
  infile = open(file, "r")
  for line in infile:
    if body:
      pass
    elif line.rstrip() == "":
      body = True
    else:
      if not fromfield.match(line) and not headerfield.match(line):
        print file, " ", line.rstrip()
  infile.close()


Second, this script consolidates header lines:

modheader.py
------------
#!/bin/env python

import glob
import re

filelist = glob.glob("[0-9]*")
filelist.sort(key=int)

# Messages start with "From "
fromfield = re.compile('^From ')
# Header keywords start with a capital letter and end with a colon
headerfield = re.compile('^[A-Z][A-Za-z_-]*?:')

for file in filelist:
  body = False
  infile = open(file, "r")
  for line in infile:
    if body:
      pass
    elif line.rstrip() == "":
      body = True
    else:
      if not fromfield.match(line) and not headerfield.match(line):
        print file, " ", line.rstrip()
  infile.close()

Of course, the scripts could be made more efficient, they could be
combined, the second one could insert whitespace instead of
concatenating, check line length, or handle temp files more gracefully,
etc. Since I only needed to do this conversion once, it wasn't worth a
lot of time...

Is this a general enough problem that there could be a need to do this
kind of thing within nmh?

Regards,
Doug

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread]	Current Thread	[Next in Thread>
[Nmh-workers] More robust header parsing...? Yahoo groups problems. Header dump and mod utilities... (Resent with attachment.), Doug Wellington <= Re: [Nmh-workers] More robust header parsing...? Yahoo groups problems. Header dump and mod utilities... (Resent with attachment.), Ralph Corderoy

Previous by Date:	Re: [Nmh-workers] Picayune Bug Report: Pick Man Page: 822 Should Be RFC822, David Levine
Next by Date:	[Nmh-workers] More robust header parsing...? Yahoo groups problems. Header dump and mod utilities..., Doug Wellington
Previous by Thread:	[Nmh-workers] Another Picayune Bug Report: mh-sequence Man page, norm
Next by Thread:	Re: [Nmh-workers] More robust header parsing...? Yahoo groups problems. Header dump and mod utilities... (Resent with attachment.), Ralph Corderoy
Indexes:	[Date] [Thread] [Top] [All Lists]

[Nmh-workers] More robust header parsing...? Yahoo groups problems. Header dump and mod utilities... (Resent with attachment.)