[ietf-smtp] BCP proposal: regular expressions for Internet Mail identifiers

2016-03-22 17:53:13
Greetings IETF-SMTP Gods and Denizens (and dispatch):

Over the winter I worked on a new Internet-Draft that I would like to propose the IETF adopts: Regular Expressions for Internet Mail. The draft focuses on two identifiers: email addresses and Message-IDs.

The purpose of this standard (proposed as a Best Current Practice) is to have *IETF-vetted* expressions that implementers and non-mail standards authors can plug-and-chug without futzing with trying to interpret 40 years of (occasionally conflicting and arcane) RFCs and implementation lore. There are many non-mail systems out there (read: nearly every web app, reservation system, customer database, etc. on Earth) that use or consume email addresses as identifiers, and their inability to accept the most obvious valid characters (like "+" or even "-"; I have used apps that do not even accept "-") is a great source of interoperability problems. (This document is also relevant to some other threads about the nature of email address identifiers in security artifacts such as certificates, PGP keys, and DNS records: anyone who is vouching for an email address ought to be sure that they are recording something that actually is a valid email address in the first place.) We should get this right now, before Unicode/EAI makes interoperability issues 50000x more expensive to correct.

The document is not meant to modify the mail standards, but merely to reflect and track them as they are updated over time.

As a first draft, the document is in rough shape and has extensive notes about issues that came up during R&D but have yet to be addressed. Significant areas that need adequate treatment include:
1. the impact of Unicode (EAI) on identifiers.
2. handling domain names, which comprise 50% of an email address, but perhaps 85% of the complexity when Unicode gets involved. 2. "deliverable email address" (complying with the modern SMTP infrastructure) vs. other kinds of email addresses (Internet Message Format, historic forms). 3. regular expression engines and grammars (i.e., which grammars to use, which are widely used and produce uniform results).
4. efficiency of the regular expressions.
5. different expressions for validation (testing), part extraction (capturing groups), decoding, encoding, and searching through text.
6. test vectors.

Hopefully the adoption of this work as an IETF item, coupled with input from those with extensive experience

(Thanks to John Levine, Pete Resnick, and others for taking initial questions and discussion on the topic.)
Discussion welcome. Thanks.


   Internet Mail identifiers are used ubiquitously throughout computing
   systems as building blocks of online identity. Unfortunately,
   incomplete understandings of the syntaxes of these identifiers has
   led to interoperability problems and poor user experiences. Many
   users use specific characters in their addresses that are not
   properly accepted on various systems. This document prescribes
   normative regular expression (regex) patterns for all Internet-
   connected systems to use when validating or parsing Internet Mail
   identifiers, with special attention to regular expressions that work
   with popular languages and platforms.

