Re: RFC validation samples?


Laird Breyer wrote:

This kind of datamining has subtleties, because some types of data are
misleading. For example, the date-time element in some headers is a
bad statistical indicator. If you learn messages which contain many
"Jul" tokens, then come August the statistical model will start to
wonder where the "Jul" tokens went.


Beware that structured header fields aren't simply "text" (see RFCs 2277
and 1958 for "text" vs. "names"); protocol elements such as month names
are typically registered or otherwise well-defined (Jan, Feb, ... Nov,
Dec) are valid month "names", "Bob" is not) and are almost always case
insensitive (jAn, jAN, jaN, JaN, and jan are valid month "names").

The second type of goal is to see if there are enough indicators in
the headers to detect forgery and/or trace the message origin. For
example, there are some simple rules which can be used, such as: the
topmost Received: line is always authentic. I'd like to explore if
there are others, e.g. can we analyse whether a message passed through
the internet (ie at least one compliant SMTP server outside the local
network), Maybe this can't be detected, but perhaps its negation can
(ie this message never left the LAN).


In theory, one could examine Received header fields to establish a
consistent set of time stamps and "from" vs. "by" domains. In
practice the Received field is so frequently malformed as to make such
analysis impractical.

I started out parsing only 2822 messages, but quickly realized that
2821 is just slightly different, and even so some sample header lines
I have don't appear to comply with either (e.g. in 2821, domains must
contain at least one period, so localhost is not allowed but
localhost.localdomain is).


RFC 2821 requires fully-qualified domain names, RFC 2822 does not.
In practice an RFC 2822-compliant message might be sent with partially
qualified domain names to an RFC 2476 submission server which qualifies
those domain names before sending via SMTP.

I believe the easiest solution is to parse
each line four times or more, strictly according to 2822,2821,822,821
and special variations, as the code can refer to existing documents
rather than a single complex combination of these four documents which
incorporates many subtleties and variations.


If performance and maintainabilty are not considerations, that might be
practical.  Liberal parsing with RFC-specific detection of issues is
an alternative approach.