ietf-822
[Top] [All Lists]

Re: RFC validation samples?

2004-08-05 09:15:38

Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> writes:
My suggestion is that you ask yourself - do you want a parser that 
validates email or do you want a parser that is useful for email 
readers or do you want a parser that can _correct_ malformed email? It 
can be difficult to do more than one of these at the same time.

The last two can be combined to some extent.


If you want the latter, you generally need to parse things according to 
RFC 822 rather than 2822, because 822's grammar is simpler and more 
permissive and more representative of what is out there.  And you need 
to look not just at the specifications but also at common kinds of 
errors.  For instance, dates are often malformed (in a wide variety of 
ways), and "." often appears in a phrase before an address.

One way to get an idea of common kinds of errors would be to use a 
strict email syntax checker to validate a large body of stored email 
(say, from various mailing list archives).  Then you could look at the 
discrepancies you find and use that information to write a looser 
parser for use by email readers.  Or maybe people here could contribute 
to a list of common email format errors.

Those are all sensible suggestions. There are several perl modules 
that do this kind of thing that are widely used by SpamAssassin etc.
that have been "trained" (i.e. manually tweaked until they stopped
complaining) on lots of mail. (They are now very robust.)
If you can read perl their code would give hints on common problems.

My guess is very little mail will be strictly RFC2822 yet.
(SPAM is often less compliant than real mail - this can use used 
 as a filter :-)).

The other thing a "useful" library for email readers would need to do 
is co-exist nicely with tools/library that understands MIME headers. 
The MIME headers are often mis-formatted.



Keith

I'm new to this list. I've started implementing a mail header
scanner/parser which will eventually be released under the GPL, as
part of a wider mail classification package I'm working on (homepage
on sourceforge: http://dbacl.sourceforge.net).

I'm really a newby on mail headers, but I've come across some
discrepancies in the RFC2821/RFC2822 grammars which led me to this
group. My current subgoal is to validate each header line separately
according to the four standards 821/288/2821/2822 and any other
relevant ones. So each header line will be marked by all the standards
which apply.

What I would like to know is if there are any publicly available
validation sample messages which I can use to check correctness of my
parser.  Apologies if this has been discussed on the list before, I
have only skimmed the archives. Any other comments and pointers 
welcome.

Regards,
-- 
Laird Breyer.




<Prev in Thread] Current Thread [Next in Thread>