[Top] [All Lists]

Re: [ietf-smtp] [dispatch] BCP proposal: regular expressions for Internet Mail identifiers

2016-03-29 13:40:28
On 3/28/2016 6:18 AM, John C Klensin wrote:

--On Sunday, March 27, 2016 22:41 -0700 "Murray S. Kucherawy"
<superuser(_at_)gmail(_dot_)com> wrote:

And if what you're really producing is regular expressions
that match anything that the ABNFs in the mail RFCs will
legitimately produce, you might want to do a standards track
document that explicitly updates those documents where those
ABNFs are listed.

That captures my concern about this effort.  Based on prior
experience (including RFC RFC 3696 and even the effort to make
RFCs 2821 and 5321 internally consistent), it is _really_ easy
to express a requirement in two different ways and have them be
_almost_ the same.   That is a problem because different people
will read different docs.

It seems to me that it would be much better to either do this as
an Informational document that is clearly identified as Sean's
opinion about regular expressions that impose the same
requirements as 5321/5322 but that those continue to control or
to do a standards-track document that contains both the regular
expressions and ABNF, makes clear which one is primary, and
updates the syntax requirements of the base specs.

As Dale expressed (thanks!), "BCPs are *standards* not for protocols but for *things that people do*. So in regard to [draft-seantek-mail-regexen], the "thing that people do" is "write code that validates e-mail addresses for further processing". And the point [...] is that people need to write correct code for validating e-mail addresses."

Sean's opinion about regular expressions for Mail Identifiers (email addresses, Message-IDs) is not interesting. If my opinion were all that interesting, I would just publish it on Stack Overflow and call it a day (see SO Questions [46155] and [201323]). What is interesting is the IETF's vetted and (rough)-consensus view on the topic.

This topic is a favorite pet project of programmers. It tends to go:
1) "oh, I know what an email address is! It has dots and alphas and maybe a hyphen" (WRONG), 2) "oh, I'll just read RFC 5322 and roll my own" (also wrong, but in more subtle ways...for one, RFC 5322 has distinct syntax from RFC 5321), or 3) "I'm lazy, let's just copy whatever regex shows up on Google first" (pragmatic, usually not right).

Wouldn't it be better if programmers could uniformly go:
4) "Given my email address recognition problem, I'll just copy the regex from BCP xyz", rather than spending dozens if not hundreds of hours pouring over email standards documents and testing them against millions of arcane email address combinations.

The current draft-seantek-mail-regexen is pretty clear (currently) that it does not attempt to change the Mail standards. If folks want to change those documents, may I suggest a separate Standards Track document that does exactly that.

Just because a document is labeled "BCP" (or, for that matter, "Standards Track") does not mean that every last single statement in the document is normative and error-free. Otherwise, the RFC 3280 and RFC 5280 PKIX standards that say that you are supposed to compare an entire email address case-insensitively (Section of RFC 3280, Section of RFC 5280) would have overridden RFCs 5322, 5321, 2822, RFC 2821, etc. etc. We have an errata process.

Basically if the regular expressions are wrong, they need to be made right. One can complain about problems, or one can fix them.

Turns out that regular expressions and ABNF are homomorphic under certain conditions. As shown in draft-seantek-mail-regexen, "deliverable email addresses" (RFC 5321 + RFC 6531) certainly fall in that definition, as they can be expressed in a regular language (i.e., computed with a finite state automaton). Therefore, translating between the two is basically computationally verifiable. The results may not look pretty but they will work. Perhaps a bigger problem is one's view as to how normative ABNF is in the context of IETF standards documents. It is possible to have ABNF that says somename = *(ALPHA / DIGIT) but then have normative text that says that <somename> is limited to 31 characters and MUST start with an alphabetic character. Moreover, some ABNF (RFC 5321 / RFC 5322 in particular) have "obsolete syntax"; whether to admit such syntax is a highly context-sensitive engineering decision. Addressing all of these points requires rubbing more than two brain cells together.

[46155]: [201323]:

Perhaps a BCP that recommends use of strings that are clearly a
proper subset of what the standard allows would be ok, but it
needs to be frightfully clear that it is a recommended subset,
not a requirement.

I am not really interested in subsets, except those subsets driven by the standards themselves. (ASCII-only vs. EAI is a reasonable subset, provided that both expressions are provided. I would rather do EAI-only but we can be pragmatic about that.)

Best regards,


ietf-smtp mailing list

<Prev in Thread] Current Thread [Next in Thread>