Re: RFC validation samples?


On Aug 05 2004, Keith Moore wrote:

My suggestion is that you ask yourself - do you want a parser that 
validates email or do you want a parser that is useful for email 
readers or do you want a parser that can _correct_ malformed email? It 
can be difficult to do more than one of these at the same time.


That's a good point. I guess I should explain my goals more
comprehensively (warning: a bit longwinded).

I already have a statistical text/email classifier which reads mail
messages. Normally, I point it to sets of mbox files and it creates a
word model. If I point it to a single mail message and give it a few
models, it attempts to pick the best model among alternatives.

I've written this in C and it accepts a hodge podge of RFC related
rules, decodes HTML and MIME structures. These capabilities have
evolved over time as random improvements, for example, MIME structures
are not decoded recursively, instead anything that looks like a MIME
separator is considered putative and if MIME attachment headers follow
then appropriate flags are set (e.g. decode QP until next separator).

The headers in my code are mostly treated as ordinary text lines, and
parsed for word tokens exactly like body lines. Some miscellaneous
rules are applied, such as decoding non-ascii encodings, but again
it's rather ad-hoc, and probably not fully standards compliant as a whole.

For statistical analysis, if parsing is correct 90%+ of the time,
that's good enough, in my experience.

So why do I want to parse headers only? There are two types of goals
I'd like to try my hand on, both of which require an actual
understanding of the elements within the headers, hence parsing more
or less properly (to repeat, so far I simply censor some fields such
as Received: and treat others such as Subject: as ordinary text)

The first type of goal is to data mine headers in a meaningful way, to
improve classification accuracy. This could be as simple as
preprocessing headers and adding a bogus MIME section with some
descriptive comments, which the statistical classifier can pick up on
as ordinary text tokens when it learns/classifies.

This kind of datamining has subtleties, because some types of data are
misleading. For example, the date-time element in some headers is a
bad statistical indicator. If you learn messages which contain many
"Jul" tokens, then come August the statistical model will start to
wonder where the "Jul" tokens went.

The second type of goal is to see if there are enough indicators in
the headers to detect forgery and/or trace the message origin. For
example, there are some simple rules which can be used, such as: the
topmost Received: line is always authentic. I'd like to explore if
there are others, e.g. can we analyse whether a message passed through
the internet (ie at least one compliant SMTP server outside the local
network), Maybe this can't be detected, but perhaps its negation can
(ie this message never left the LAN).

But these are long term goals, and people on this list have likely
thought about these problems very deeply (I would enjoy discussing
these questions on this list). Right now I have a basic parser which
is functional but buggy, which is why I'm looking for test cases.

I started out parsing only 2822 messages, but quickly realized that
2821 is just slightly different, and even so some sample header lines
I have don't appear to comply with either (e.g. in 2821, domains must
contain at least one period, so localhost is not allowed but
localhost.localdomain is). I believe the easiest solution is to parse
each line four times or more, strictly according to 2822,2821,822,821
and special variations, as the code can refer to existing documents
rather than a single complex combination of these four documents which
incorporates many subtleties and variations.

Or maybe people here could contribute to a list of common email
format errors.


That would be very useful. If it's appropriate, I'm happy to discuss
on this list any format errors I come across.

-- 
Laird Breyer.