Hey OpenPGP folks--
I'd like to have a clearer undersatnding about the actual conventions
for OpenPGP User IDs in the context of e-mail. The standards currently
say that the convention is an RFC2822 "name-addr", but (as detailed
below), that does not appear to be the actual convention in practice.
While we're updating RFC 4880, we should fix the standards to reflect
reality. There are two proposals at the end that i'd love feedback on.
I prefer proposal 2.
Claims about name-addr
RFC 4880 says the following:
5.11. User ID Packet (Tag 13)
A User ID packet consists of UTF-8 text that is intended to represent
the name and email address of the key holder. By convention, it
includes an RFC 2822 [RFC2822] mail name-addr, but there are no
restrictions on its content. The packet length in the header
specifies the length of the User ID.
RFC4880bis repeats the above, and adds:
5.13.2. User ID Attribute Subpacket
A User ID Attribute subpacket, just like a User ID packet, consists
of UTF-8 text that is intended to represent the name and email
address of the key holder. By convention, it includes an RFC 2822
[RFC2822] mail name-addr, but there are no restrictions on its
content. For devices using OpenPGP for device certificates, it may
just be the device identifier. The packet length in the header
specifies the length of the User ID.
Both of these references to rfc 2822 are problematic. Real user IDs
don't look like this, and other implementations won't parse things this
way either, so the implementers might be led astray by this
User ID convention is not a name-addr
Here are a few concrete reasons why the convention is not actually an
RFC 2822 name-addr:
a) name-addr in RFC 2822 is defined to be a US-ASCII field, potentially
charset-switched with RFC 2047 extensions in at least the
display-name part. But User IDs are native UTF-8. For example,
compare the following strings:
1) Björn Björnson <bjoern(_at_)example(_dot_)net>
2) Bj=?utf-8?q?=C3=B6?=rn Bj=?utf-8?q?=C3=B6?=rnson
We expect User IDs to look like (1), even though (2) is technically
an RFC 2822 mail-addr. We don't want people to generate user IDs
like (2), and we don't want implementations to try to apply RFC 2047
decoding to the contents of a user ID packet to be able to display
b) name-addr doesn't allow non-quoted internal commas or apostrophes,
so the following common User ID patterns are not technically
name-addrs either, though implementations generate them, and people
use them just fine in the real world:
3) Acme Industries, Inc. <firstname.lastname@example.org>
4) Michael O'Brian <obrian(_at_)example(_dot_)biz>
5) Smith, John <jsmith(_at_)example(_dot_)com>
c) in RFC 2822, a <name-addr> is not the same as a "mailbox" -- a
"mailbox" is either a "name-addr" (which contains an "addr-spec") or
an "addr-spec" on its own. But we have many examples in flight
today of user IDs that are just a raw "addr-spec" (without
angle-brackets), and those tend to be accepted by many OpenPGP
d) the "display-name" part of an RFC 2822 "name-addr" is a "phrase" (a
series of "word"s, which are either "atom"s or "quoted-string"s).
An "atom" cannot contain the "@" symbol, so the display-name cannot
contain an unquoted @. However, due to infelicities in common
interfaces, we also see a large number of user IDs that simply
replicate the addr-spec as though it were the domain name. This is
not a valid name-addr, but it is accepted by most OpenPGP
7) joe(_at_)example(_dot_)net <joe(_at_)example(_dot_)net>
These differences between RFC 2822's name-addr and the actual user IDs
in use today suggest that the guidance that they are "by convention" a
name-addr is a mistake, and a potentially damaging one at that. It's
likely to cause implementers to do expensive implementations of the
complex name-addr syntax, which they then have to make exceptions for
when they encounter all the real-world counterexamples.
At the same time, we don't want implementers to each have their own
arbitrary deviations from the convention -- the more uniform we can make
the convention, the more likely we'll be to have interoperability.
AFAICT, there is one main, uncontroversial technical goal for an
e-mail-focused OpenPGP implementation when dealing with user IDs:
A) extract the addr-spec
If the implementation can't figure out the addr-spec, they can't use
the certificate to learn how to contact. and if the implementation
can't index internally by addr-spec, then they can't find the
appropriate certificate to use when trying to contact a given e-mail
What we really want is for every implementation to do this in a robust
and predictable way, including for all of the common non-mail-addr forms
Are there any other goals that people think this convention should
Some (possibly-contentious) additional goals:
B) accepting UTF-8 addr-specs
recent RFCs about internationalization accept non-ASCII characters
in domain names and local-parts of the addr-spec:
do we expect user agents to be able to extract addr-specs that look
(These examples are from
C) accepting really unusual addr-specs:
the addr-spec definition formally includes some really bizarre
structures that (while probably in use on some legacy systems) are a
really bad idea. For example, localparts that are wrapped in
double-quotes but otherwise contain forbidden characters can be
It looks to me like RFC 5322 even allows CFWS in the local-part,
ugh. Do we expect user agents to do anything sensible with these
A non-goal (does anyone want this?):
D) be able to distinguish the "comment" from the "name" in display-name:
Despite several implementations appearing to distinguish "Comment"
from "Name" in the display-part, it's not clear that anyone *does*
anything with that information, so it's mainly clutter and
confusion. On top of that, there are probably more useless comments
than useful ones, so i'd be happy to let this misfeature die out.
Proposal 1: unicode maybe-wrapped addr-spec
We can address goals A, B, and C with some sort of language that
acknowledges reality if we accept the following:
* addr-spec from RFC 5322 is augmented by the definitions
in RFC 6532 section 3
* there is no structure that we care about in what we would have called
the "display-name" part of the supposed name-addr.
Then the user ID convention becomes (again, assuming atext as augmented
by 6532 §3):
pgp-uid-prefix-char = atext / specials
pgp-uid-convention = addr-spec /
*pgp-uid-prefix-char "<" addr-spec ">"
Proposal 2: simplify, simplify
Proposal 1 is still pretty ugly due to the inherent complexities of
We can simplify the formal addr-spec greatly if:
- we don't allow CFWS or quoted-string in the local-part, and
- we don't allow CFWS or domain-literal addresses in the domain, and
- we drop all the obsolete variants ("obs-*" labels in RFC 5322 ABNF)
CFWS is "comments and folding whitespace". Dropping comments is
justified by the argument that comments can go elsewhere in the user ID.
Folding-whitespace isn't necessary due to the structure of the user
ID itself -- we're not in an e-mail message header. Dropping obsolete
parts is justified because they're obsolete. Dropping quoted-string is
justified because it's rarely used, and likely to break in reality. And
dropping domain-literal parts is justified because no one delivers
e-mail to raw IP addresses anyway.
Note that yes, this will discard some legitimate (if odd) addresses
(e.g. ones with CFWS or quoted-string), and it may fail to recognize
some legacy (odd) user IDs (obs-* or domain-literal). But we're
describing a convention here, not making a normative statement, and we
can do much better than the convention we were describing earlier but
pretty much every implementation fails to follow.
Using the definitions in RFC 5322 and RFC 5234, as augmented by RFC 6532
section 3, we can implement this simplification like so:
pgp-addr-spec = dot-atom-text "@" dot-atom-text
pgp-uid-prefix-char = atext / specials
pgp-uid-convention = pgp-addr-spec /
*pgp-uid-prefix-char "<" pgp-addr-spec ">"
Note that every pgp-addr-spec is by definition an addr-spec (though not
all addr-specs are a pgp-addr-spec).
I believe that proposal 2 is closer to what most implementations do
today, and it handles goals A and B. I don't mind it failing at goal
C because of how much simpler the matching rule is.
My preference is to replace the text about User ID conventions in RFC
4880bis with proposal 2, but i'd be open to hearing other suggestions if
anyone has them.
PS in researching other ways to solve this problem, i came up with an
approach that relies on Unicode character properties, in particular
Grapheme_Base and Grapheme_Extend as a way to exclude control chars
and other non-printables. This is a more sophisticated/nuanced
approach than the RFC 6532 ABNF extensions to atext. But specifying
it requires a character class set subtraction operation (you want to
subtract "<" and ">" and "@" and " " from the Grapheme_* classes),
which isn't listed in IETF's ABNF definition in RFC 5324. And
implementing it requires a toolkit capable of discerning and acting
on Unicode properties (e.g. the python regex module from PyPi, but
not the re module from python's stdlib). That's too bad, because
6532 §3 effectively makes things like U+200B ZERO WIDTH JOINER
allowable within dot-atom-text, which is uncomfortable and weird.
But other implementers reliant on 6532 might accept such a localpart
anyway. These costs don't appear to be worth the minor gain compared
to proposal 2, so i've stopped attempting to document that approach.
If anyone wants to take a crack at it though, i'm happy to share my
Description: PGP signature
openpgp mailing list