Re: RFC validation samples?


On Aug 05 2004, Bruce Lilly wrote:

Laird Breyer wrote:

This kind of datamining has subtleties, because some types of data are
misleading. For example, the date-time element in some headers is a
bad statistical indicator. If you learn messages which contain many
"Jul" tokens, then come August the statistical model will start to
wonder where the "Jul" tokens went.


Beware that structured header fields aren't simply "text" (see RFCs 2277
and 1958 for "text" vs. "names"); protocol elements such as month names
are typically registered or otherwise well-defined (Jan, Feb, ... Nov,
Dec) are valid month "names", "Bob" is not) and are almost always case
insensitive (jAn, jAN, jaN, JaN, and jan are valid month "names").


There's a funny connection between these types of protocol elements,
spam and statistical learning. 

The statistical learner picks up tokens, whether
true "text" or a defined protocol element, and associates observed
frequency counts. So month names such as Jan, Feb etc. become
important (for the statistical model) solely because of their observed
frequency. 

Coincidentally, the relative frequencies of such tokens over several
emails are normally high because they are machine generated during
transport, and transport often passes through the same small
collection of MTAs close to the destination.

On the other hand, there is much more variety in such tokens at the
spammer end, and some spam sending software may well include "jAn"
instead of "Jan" in a header, for some unknown reason. When this
happens frequently, the statistical learner may associate "jAn" with
spam from this spammer's software, even though the RFC protocol itself does not
distinguish "Jan" from "jAn". 

Sometimes, this case sensitive distinction is a win, meaning it helps
the statistical classifier make fewer mistakes, and sometimes it isn't
- it's difficult to predict. In the example of the date-time protocol element,
my experience is that including an email's date stamps in the learning data is
on the whole counterproductive. However, for tracing message originators and 
path forgeries, date-time discrepancies are strong indicators of spam.

The second type of goal is to see if there are enough indicators in
the headers to detect forgery and/or trace the message origin. For
example, there are some simple rules which can be used, such as: the
topmost Received: line is always authentic. I'd like to explore if
there are others, e.g. can we analyse whether a message passed through
the internet (ie at least one compliant SMTP server outside the local
network), Maybe this can't be detected, but perhaps its negation can
(ie this message never left the LAN).


In theory, one could examine Received header fields to establish a
consistent set of time stamps and "from" vs. "by" domains. In
practice the Received field is so frequently malformed as to make such
analysis impractical.


I believe statistics can somewhat help with this, or at least it's worth
giving it a shot ;-) 

When analysing a large collection of mail with the 
same destination, some Received: lines and their idiosyncracies are more
frequent than others. Suppose you have a test which shows a header was forged,
but this test is only 60% accurate. You may have several such tests, and
statistics can combine them into a preponderance of evidence scheme, 
kind of like SpamAssassin already does. 

But SpamAssassin is network wide, its rules are chosen to work for
many users simultaneously, and because they are published spammers
study them and find ways to bypass them. I believe that rules based
directly on published RFCs may be more universal and less "noisy",
and of course a statistical learner can combine them in ways specific
to the destination, which makes it much harder for spammers to get through.

I started out parsing only 2822 messages, but quickly realized that
2821 is just slightly different, and even so some sample header lines
I have don't appear to comply with either (e.g. in 2821, domains must
contain at least one period, so localhost is not allowed but
localhost.localdomain is).


RFC 2821 requires fully-qualified domain names, RFC 2822 does not.
In practice an RFC 2822-compliant message might be sent with partially
qualified domain names to an RFC 2476 submission server which qualifies
those domain names before sending via SMTP.


Such a submission server would show up in the trace fields, no?

I believe the easiest solution is to parse
each line four times or more, strictly according to 2822,2821,822,821
and special variations, as the code can refer to existing documents
rather than a single complex combination of these four documents which
incorporates many subtleties and variations.


If performance and maintainabilty are not considerations, that might be
practical.  Liberal parsing with RFC-specific detection of issues is
an alternative approach.


Performance is not a strong worry, at this stage. The code I have in
dbacl is pretty fast, even if I say so myself (it can handle about
100-200 typical mails per second depending on options, on a 500MHz
Pentium 3 Debian box, not counting process startup costs, and based on
classifying thousands of mails sequentially for cross-validation
simulations), and even so it probably spends half of its time parsing
mail bodies.

-- 
Laird Breyer.