nmh-workers
[Top] [All Lists]

[Nmh-workers] Thoughts: header/address parsing

2014-08-02 20:51:42
Again, more technical details here.

Address parsing in nmh is kind of a mess.  We still support RFC 733 syntax
"address at host", UUCP stuff, source routing ... a bunch of stuff.  This
should be fixed.

m_getfld() is the handler for generically parsing the headers of an email
message.  Everyone agrees that it pretty much sucks and is overused.
Thankfully the worst part of it (peeking inside of stdio internals) has
been fixed; thanks, David!

I've been thinking about biting the bullet and simply writing a header
parser in flex/bison (I'm assuming flex/bison because those have
features that make this a lot easier to implement; you don't need
either to build from a distribution, because Automake keeps around
the generated C files for the distribution tar file).  But practical
concerns rear their ugly heads again; for one, error recovery is kind of
complicated.  But it occurs to me that maybe I'm trying to bite off more
than I can chew, and maybe I should try breaking this down a bit.  It
occurs to me that there are really five distinct grammers that we should
think about:

- Parsing a sequence of message headers.  This is really what m_getfld()
  does now.  This grammar could be pretty simple.  We could use this to
  stuff headers inside of the "new" message API, discussed previously.
  The headers wouldn't be interpreted yet.

- Parsing an address header.  This is by far the most complicated part
  of the parser, but I think just taking the RFC 5322 ABNF and translating
  it into a bison grammar shouldn't be too bad.

- Parsing a date header.  We have a lex parser that does this now; it occurs
  to me that it should really be a bison grammar, but whatever.  Solvable
  problem.

- Parsing a MIME header/param list.  Right now the parser for this is awful;
  and I say that as someone who had to add support for parsing out the
  RFC 2231 parameter extensions.  I'm not so crazy about blowing all of
  that work up, but you know what?  I think it would just be easier
  in the long run to deal with it if it was based on bison.

- Parsing a mhbuild directive.  These are kind of like a MIME header, but not
  exactly.  The grammer for this is actually pretty weird and picky.  Right
  now it's overloaded on the MIME header parser, but it occurs to be that
  there's no reason that should be the case.

The other headers ... well, I guess I don't see a reason why we need to parse
them.  If the message-id header doesn't match the RFC 5322 syntax, should
we care?  I say no.

Modern flex/bison implementations can handle multiple parsers in one
program, so that's not an issue.  This would also let us get rid of the
horrible fixed buffer sizes we have now.

Thoughts?  Completely open to ideas here.  I remember people saying that
they had a list of messages that nmh dealt poorly with; it would be nice
to try those out against a hypothetically-new nmh parser.

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>