ietf-822
[Top] [All Lists]

Re: RFC validation samples?

2004-08-05 19:04:20

On Aug 05 2004, Bruce Lilly wrote:

If you are going to classify message content, you'll need to be able
to parse the message body in addition to the header.  For MIME
messages, you'll also need to be able to handle MIME-part headers,
boundary delimiters, and body sections.  Including base64 and
quoted-printable encoded body content.

That is correct. In my reply to Keith Moore, I described that I
already do this kind of thing, not perfectly but sufficiently well for
statistical analysis I think. At this point, I'd like to concentrate
on header parsing and data mining, which is where a lot of interesting
information is buried.

There is an open-source validating parser available at
http://users.erols.com/blilly/mparse.  It may do what you want,
and if not, the package does contain several dozen test messages,
many of them taken directly from the relevant RFCs (with
correction per the errata in most cases).

Argh! I could have used this if I hadn't written most of my parser already! 
This is very nice work! Is your code/library thread safe? It does much more
than I need, but looks extremely complete. I shall certainly play with those
test messages.


I'm really a newby on mail headers, but I've come across some
discrepancies in the RFC2821/RFC2822 grammars which led me to this
group.

There are indeed a few differences; Some are because RFC 2821 is
SMTP-specific whereas RFC 2822 is intended to be a general message
format (so, for example, RFC 2822 has a very loose definition of
"domain-literal" and of the Received header field).  A very few
are genuine incompatibilities (likely to be corrected in the next
revisions, due soon).

I believe the easiest (also most maintainable) approach for me is to 
parse each line several times, according to "pure" 2822, 2821, etc.

The incompatibilities are actually a possible "feature".  For
instance, would it be reasonable to expect that a message, none of
whose Received: lines follow 2821 correctly, hasn't passed through an
internet SMTP server? I.e. I'm assuming that the majority of internet
SMTP servers are RFC 821/2821 compliant and follow the grammars therein
correctly.

There are a few dozen relevant RFCs (depending on what types of
messages you want to be able to handle, at what level of detail,
whether or not you want to be able to handle "old" messages, and
if so, how old, etc.).  Also, don't forget to look at the RFC
Errata page (http://www.rfc-editor.org/errata.html), as there
are some errors in published RFCs.

Good advice, thanks.
-- 
Laird Breyer.


<Prev in Thread] Current Thread [Next in Thread>