Re: [ietf-smtp] [dispatch] BCP proposal: regular expressions for Internet Mail identifiers
2016-03-29 13:40:28
On 3/28/2016 6:18 AM, John C Klensin wrote:
--On Sunday, March 27, 2016 22:41 -0700 "Murray S. Kucherawy"
<superuser(_at_)gmail(_dot_)com> wrote:
...
And if what you're really producing is regular expressions
that match anything that the ABNFs in the mail RFCs will
legitimately produce, you might want to do a standards track
document that explicitly updates those documents where those
ABNFs are listed.
Murray,
That captures my concern about this effort. Based on prior
experience (including RFC RFC 3696 and even the effort to make
RFCs 2821 and 5321 internally consistent), it is _really_ easy
to express a requirement in two different ways and have them be
_almost_ the same. That is a problem because different people
will read different docs.
It seems to me that it would be much better to either do this as
an Informational document that is clearly identified as Sean's
opinion about regular expressions that impose the same
requirements as 5321/5322 but that those continue to control or
to do a standards-track document that contains both the regular
expressions and ABNF, makes clear which one is primary, and
updates the syntax requirements of the base specs.
As Dale expressed (thanks!), "BCPs are *standards* not for protocols but
for *things that people do*. So in regard to
[draft-seantek-mail-regexen], the "thing that people do" is "write code
that validates e-mail addresses for further processing". And the point
[...] is that people need to write correct code for validating e-mail
addresses."
Sean's opinion about regular expressions for Mail Identifiers (email
addresses, Message-IDs) is not interesting. If my opinion were all that
interesting, I would just publish it on Stack Overflow and call it a day
(see SO Questions [46155] and [201323]). What is interesting is the
IETF's vetted and (rough)-consensus view on the topic.
This topic is a favorite pet project of programmers. It tends to go:
1) "oh, I know what an email address is! It has dots and alphas and
maybe a hyphen" (WRONG),
2) "oh, I'll just read RFC 5322 and roll my own" (also wrong, but in
more subtle ways...for one, RFC 5322 has distinct syntax from RFC 5321), or
3) "I'm lazy, let's just copy whatever regex shows up on Google first"
(pragmatic, usually not right).
Wouldn't it be better if programmers could uniformly go:
4) "Given my email address recognition problem, I'll just copy the regex
from BCP xyz", rather than spending dozens if not hundreds of hours
pouring over email standards documents and testing them against millions
of arcane email address combinations.
The current draft-seantek-mail-regexen is pretty clear (currently) that
it does not attempt to change the Mail standards. If folks want to
change those documents, may I suggest a separate Standards Track
document that does exactly that.
Just because a document is labeled "BCP" (or, for that matter,
"Standards Track") does not mean that every last single statement in the
document is normative and error-free. Otherwise, the RFC 3280 and RFC
5280 PKIX standards that say that you are supposed to compare an entire
email address case-insensitively (Section 4.1.2.6 of RFC 3280, Section
4.2.1.6 of RFC 5280) would have overridden RFCs 5322, 5321, 2822, RFC
2821, etc. etc. We have an errata process.
Basically if the regular expressions are wrong, they need to be made
right. One can complain about problems, or one can fix them.
Turns out that regular expressions and ABNF are homomorphic under
certain conditions. As shown in draft-seantek-mail-regexen, "deliverable
email addresses" (RFC 5321 + RFC 6531) certainly fall in that
definition, as they can be expressed in a regular language (i.e.,
computed with a finite state automaton). Therefore, translating between
the two is basically computationally verifiable. The results may not
look pretty but they will work. Perhaps a bigger problem is one's view
as to how normative ABNF is in the context of IETF standards documents.
It is possible to have ABNF that says somename = *(ALPHA / DIGIT) but
then have normative text that says that <somename> is limited to 31
characters and MUST start with an alphabetic character. Moreover, some
ABNF (RFC 5321 / RFC 5322 in particular) have "obsolete syntax"; whether
to admit such syntax is a highly context-sensitive engineering decision.
Addressing all of these points requires rubbing more than two brain
cells together.
[46155]:
http://stackoverflow.com/questions/46155/validate-email-address-in-javascript
[201323]:
http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address
Perhaps a BCP that recommends use of strings that are clearly a
proper subset of what the standard allows would be ok, but it
needs to be frightfully clear that it is a recommended subset,
not a requirement.
I am not really interested in subsets, except those subsets driven by
the standards themselves. (ASCII-only vs. EAI is a reasonable subset,
provided that both expressions are provided. I would rather do EAI-only
but we can be pragmatic about that.)
Best regards,
Sean
_______________________________________________
ietf-smtp mailing list
ietf-smtp(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf-smtp
|
|