[openpgp] User ID conventions (it's not really a RFC2822 name-addr)

Hey OpenPGP folks--

I'd like to have a clearer undersatnding about the actual conventions
for OpenPGP User IDs in the context of e-mail.  The standards currently
say that the convention is an RFC2822 "name-addr", but (as detailed
below), that does not appear to be the actual convention in practice.

While we're updating RFC 4880, we should fix the standards to reflect
reality.  There are two proposals at the end that i'd love feedback on.
I prefer proposal 2.

Claims about name-addr
----------------------

RFC 4880 says the following:

    5.11.  User ID Packet (Tag 13)

    A User ID packet consists of UTF-8 text that is intended to represent
    the name and email address of the key holder.  By convention, it
    includes an RFC 2822 [RFC2822] mail name-addr, but there are no
    restrictions on its content.  The packet length in the header
    specifies the length of the User ID.

RFC4880bis repeats the above, and adds:

    5.13.2.  User ID Attribute Subpacket

    […]

    A User ID Attribute subpacket, just like a User ID packet, consists
    of UTF-8 text that is intended to represent the name and email
    address of the key holder.  By convention, it includes an RFC 2822
    [RFC2822] mail name-addr, but there are no restrictions on its
    content.  For devices using OpenPGP for device certificates, it may
    just be the device identifier.  The packet length in the header
    specifies the length of the User ID.

Both of these references to rfc 2822 are problematic.  Real user IDs
don't look like this, and other implementations won't parse things this
way either, so the implementers might be led astray by this
documentation.

User ID convention is not a name-addr
-------------------------------------

Here are a few concrete reasons why the convention is not actually an
RFC 2822 name-addr:

 a) name-addr in RFC 2822 is defined to be a US-ASCII field, potentially
    charset-switched with RFC 2047 extensions in at least the
    display-name part.  But User IDs are native UTF-8.  For example,
    compare the following strings:

     1) Björn Björnson <bjoern(_at_)example(_dot_)net>
     2) Bj=?utf-8?q?=C3=B6?=rn Bj=?utf-8?q?=C3=B6?=rnson 
<bjoern(_at_)example(_dot_)net>

    We expect User IDs to look like (1), even though (2) is technically
    an RFC 2822 mail-addr.  We don't want people to generate user IDs
    like (2), and we don't want implementations to try to apply RFC 2047
    decoding to the contents of a user ID packet to be able to display
    it.

 b) name-addr doesn't allow non-quoted internal commas or apostrophes,
    so the following common User ID patterns are not technically
    name-addrs either, though implementations generate them, and people
    use them just fine in the real world:

     3) Acme Industries, Inc. <info@acme.example>
     4) Michael O'Brian <obrian(_at_)example(_dot_)biz>
     5) Smith, John <jsmith(_at_)example(_dot_)com>

 c) in RFC 2822, a <name-addr> is not the same as a "mailbox" -- a
    "mailbox" is either a "name-addr" (which contains an "addr-spec") or
    an "addr-spec" on its own.  But we have many examples in flight
    today of user IDs that are just a raw "addr-spec" (without
    angle-brackets), and those tend to be accepted by many OpenPGP
    implementations:

     6) mariag(_at_)example(_dot_)org

 d) the "display-name" part of an RFC 2822 "name-addr" is a "phrase" (a
    series of "word"s, which are either "atom"s or "quoted-string"s).
    An "atom" cannot contain the "@" symbol, so the display-name cannot
    contain an unquoted @.  However, due to infelicities in common
    interfaces, we also see a large number of user IDs that simply
    replicate the addr-spec as though it were the domain name.  This is
    not a valid name-addr, but it is accepted by most OpenPGP
    implementations.

    For example:

     7) joe(_at_)example(_dot_)net <joe(_at_)example(_dot_)net>

These differences between RFC 2822's name-addr and the actual user IDs
in use today suggest that the guidance that they are "by convention" a
name-addr is a mistake, and a potentially damaging one at that.  It's
likely to cause implementers to do expensive implementations of the
complex name-addr syntax, which they then have to make exceptions for
when they encounter all the real-world counterexamples.

At the same time, we don't want implementers to each have their own
arbitrary deviations from the convention -- the more uniform we can make
the convention, the more likely we'll be to have interoperability.

Goals
-----

AFAICT, there is one main, uncontroversial technical goal for an
e-mail-focused OpenPGP implementation when dealing with user IDs:

 A) extract the addr-spec

    If the implementation can't figure out the addr-spec, they can't use
    the certificate to learn how to contact.  and if the implementation
    can't index internally by addr-spec, then they can't find the
    appropriate certificate to use when trying to contact a given e-mail
    address.

What we really want is for every implementation to do this in a robust
and predictable way, including for all of the common non-mail-addr forms
described above.

Are there any other goals that people think this convention should
cover?

Some (possibly-contentious) additional goals:

 B) accepting UTF-8 addr-specs

    recent RFCs about internationalization accept non-ASCII characters
    in domain names and local-parts of the addr-spec:

    https://tools.ietf.org/html/rfc6530#section-10.1
    https://tools.ietf.org/html/rfc6532#section-3.2

    do we expect user agents to be able to extract addr-specs that look
    like:

        иван.сергеев@пример.рф
        Dörte@Sörensen.example.com

    (These examples are from
    https://en.m.wikipedia.org/wiki/International_email)

 C) accepting really unusual addr-specs:

    the addr-spec definition formally includes some really bizarre
    structures that (while probably in use on some legacy systems) are a
    really bad idea.  For example, localparts that are wrapped in
    double-quotes but otherwise contain forbidden characters can be
    problematic:

       "Abc@def"@example.com
       "Fred Bloggs"@example.com

    It looks to me like RFC 5322 even allows CFWS in the local-part,
    ugh.  Do we expect user agents to do anything sensible with these
    addresses?

A non-goal (does anyone want this?):

 D) be able to distinguish the "comment" from the "name" in display-name:

    Despite several implementations appearing to distinguish "Comment"
    from "Name" in the display-part, it's not clear that anyone *does*
    anything with that information, so it's mainly clutter and
    confusion.  On top of that, there are probably more useless comments
    than useful ones, so i'd be happy to let this misfeature die out.


Proposal 1: unicode maybe-wrapped addr-spec
-------------------------------------------

We can address goals A, B, and C with some sort of language that
acknowledges reality if we accept the following:

 * addr-spec from RFC 5322 is augmented by the definitions
   in RFC 6532 section 3

 * there is no structure that we care about in what we would have called
   the "display-name" part of the supposed name-addr.

Then the user ID convention becomes (again, assuming atext as augmented
by 6532 §3):

    pgp-uid-prefix-char    = atext / specials

    pgp-uid-convention     = addr-spec /
                             *pgp-uid-prefix-char "<" addr-spec ">"


Proposal 2: simplify, simplify
------------------------------

Proposal 1 is still pretty ugly due to the inherent complexities of
addr-spec itself.

We can simplify the formal addr-spec greatly if:

 - we don't allow CFWS or quoted-string in the local-part, and
 - we don't allow CFWS or domain-literal addresses in the domain, and
 - we drop all the obsolete variants ("obs-*" labels in RFC 5322 ABNF)

CFWS is "comments and folding whitespace".  Dropping comments is
justified by the argument that comments can go elsewhere in the user ID.
Folding-whitespace isn't necessary due to the structure of the user
ID itself -- we're not in an e-mail message header.  Dropping obsolete
parts is justified because they're obsolete.  Dropping quoted-string is
justified because it's rarely used, and likely to break in reality.  And
dropping domain-literal parts is justified because no one delivers
e-mail to raw IP addresses anyway.

Note that yes, this will discard some legitimate (if odd) addresses
(e.g. ones with CFWS or quoted-string), and it may fail to recognize
some legacy (odd) user IDs (obs-* or domain-literal).  But we're
describing a convention here, not making a normative statement, and we
can do much better than the convention we were describing earlier but
pretty much every implementation fails to follow.

Using the definitions in RFC 5322 and RFC 5234, as augmented by RFC 6532
section 3, we can implement this simplification like so:

    pgp-addr-spec          = dot-atom-text "@" dot-atom-text
    
    pgp-uid-prefix-char    = atext / specials
    
    pgp-uid-convention     = pgp-addr-spec /
                             *pgp-uid-prefix-char "<" pgp-addr-spec ">"

Note that every pgp-addr-spec is by definition an addr-spec (though not
all addr-specs are a pgp-addr-spec).


I believe that proposal 2 is closer to what most implementations do
today, and it handles goals A and B.  I don't mind it failing at goal
C because of how much simpler the matching rule is.

Conclusion
----------

My preference is to replace the text about User ID conventions in RFC
4880bis with proposal 2, but i'd be open to hearing other suggestions if
anyone has them.

        --dkg

PS in researching other ways to solve this problem, i came up with an
   approach that relies on Unicode character properties, in particular
   Grapheme_Base and Grapheme_Extend as a way to exclude control chars
   and other non-printables.  This is a more sophisticated/nuanced
   approach than the RFC 6532 ABNF extensions to atext.  But specifying
   it requires a character class set subtraction operation (you want to
   subtract "<" and ">" and "@" and " " from the Grapheme_* classes),
   which isn't listed in IETF's ABNF definition in RFC 5324.  And
   implementing it requires a toolkit capable of discerning and acting
   on Unicode properties (e.g. the python regex module from PyPi, but
   not the re module from python's stdlib).  That's too bad, because
   6532 §3 effectively makes things like U+200B ZERO WIDTH JOINER
   allowable within dot-atom-text, which is uncomfortable and weird.
   But other implementers reliant on 6532 might accept such a localpart
   anyway. These costs don't appear to be worth the minor gain compared
   to proposal 2, so i've stopped attempting to document that approach.
   If anyone wants to take a crack at it though, i'm happy to share my
   notes.

signature.asc
Description: PGP signature

_______________________________________________
openpgp mailing list
openpgp(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/openpgp