nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] Thoughts: header/address parsing

2014-08-03 14:01:36
Ken Hornstein <kenh(_at_)pobox(_dot_)com> writes:
Again, more technical details here.

Address parsing in nmh is kind of a mess.  We still support RFC 733 syntax
"address at host", UUCP stuff, source routing ... a bunch of stuff.  This
should be fixed.

m_getfld() is the handler for generically parsing the headers of an email
message.  Everyone agrees that it pretty much sucks and is overused.
Thankfully the worst part of it (peeking inside of stdio internals) has
been fixed; thanks, David!

I've been thinking about biting the bullet and simply writing a header
parser in flex/bison (I'm assuming flex/bison because those have
features that make this a lot easier to implement; you don't need
either to build from a distribution, because Automake keeps around
the generated C files for the distribution tar file).  But practical
concerns rear their ugly heads again; for one, error recovery is kind of
complicated.  But it occurs to me that maybe I'm trying to bite off more
than I can chew, and maybe I should try breaking this down a bit.  It
occurs to me that there are really five distinct grammers that we should
think about:

- Parsing a sequence of message headers.  This is really what m_getfld()
does now.  This grammar could be pretty simple.  We could use this to
stuff headers inside of the "new" message API, discussed previously.
The headers wouldn't be interpreted yet.

- Parsing an address header.  This is by far the most complicated part
of the parser, but I think just taking the RFC 5322 ABNF and translating
it into a bison grammar shouldn't be too bad.

- Parsing a date header.  We have a lex parser that does this now; it occurs
to me that it should really be a bison grammar, but whatever.  Solvable
problem.

- Parsing a MIME header/param list.  Right now the parser for this is awful;
and I say that as someone who had to add support for parsing out the
RFC 2231 parameter extensions.  I'm not so crazy about blowing all of
that work up, but you know what?  I think it would just be easier
in the long run to deal with it if it was based on bison.

- Parsing a mhbuild directive.  These are kind of like a MIME header, but not
exactly.  The grammer for this is actually pretty weird and picky.  Right
now it's overloaded on the MIME header parser, but it occurs to be that
there's no reason that should be the case.

The other headers ... well, I guess I don't see a reason why we need to parse
them.  If the message-id header doesn't match the RFC 5322 syntax, should
we care?  I say no.

Modern flex/bison implementations can handle multiple parsers in one
program, so that's not an issue.  This would also let us get rid of the
horrible fixed buffer sizes we have now.

Thoughts?  Completely open to ideas here.  I remember people saying that
they had a list of messages that nmh dealt poorly with; it would be nice
to try those out against a hypothetically-new nmh parser.

--Ken

I wondering, if in doing this, you might consider a new nmh command that would
parse message headers. I suppose that there a dozens of scripts out that there
do some of this. I'm guessing that they are mostly all ad hoc, and buggy.




    Norman Shapiro

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>