Re: [openpgp] User ID conventions (it's not really a RFC2822 name-addr)

I'm considering using the following "grammar".  (I've put grammar in
scare quotes, because it is not a valid grammar according to RFC 5322
due to several ambiguities.  In particular, the production "*WS
[name] *WS" is ambiguous when applied to a string containing a single
whitespace character: the whitespace character could match the first
WS or the second one.  In practice, this ambiguity doesn't matter,
because we only care about what the "name", "comment-content" and
"addr-spec" productions match.)

     WS                 = 0x20 (space character)

     comment-specials   = "<" / ">" /   ; RFC 2822 specials - "(" and ")"
                          "[" / "]" /
                          ":" / ";" /
                          "@" / "\" /
                          "," / "." /
                          DQUOTE

     atext-specials     = "(" / ")" /   ; RFC 2822 specials - "<" and ">".
                          "[" / "]" /
                          ":" / ";" /
                          "@" / "\" /
                          "," / "." /
                          DQUOTE

     atext              = ALPHA / DIGIT /   ; Any character except controls,
                          "!" / "#" /       ;  SP, and specials.
                          "$" / "%" /       ;  Used for atoms
                          "&" / "'" /
                          "*" / "+" /
                          "-" / "/" /
                          "=" / "?" /
                          "^" / "_" /
                          "`" / "{" /
                          "|" / "}" /
                          "~" /
                          \u{80}-\u{10ffff} ; Non-ascii, non-control UTF-8

     name-char-start    = atext / atext-specials

     name-char-rest     = atext / atext-specials / WS

     name               = name-char-start *name-char-rest

     comment-char       = atext / comment-specials / WS

     comment-content    = *comment-char

     comment            = "(" *WS comment-content *WS ")"

     addr-spec          = dot-atom-text "@" dot-atom-text

     pgp-uid-convention = addr-spec /
                          *WS [name] *WS [comment] *WS "<" addr-spec ">" /
                          *WS name *WS [comment] *WS

Beyond being more fleshed out, this grammar is different from the
grammar in dkg's second proposal in a few ways.

First, it matches comments.  dkg made this a non-goal.  Given that
people who add comments intend them as comments and not as part of
their name, it seems reasonable to me to not display comments in
places where only the user's name is desired.  And, since it turns out
that matching non-nested comments is relatively straightforward, why
not?  Note: doing this might actually help deprecate comments, because
they won't be shown as often.

The grammar more carefully handles whitespace.  It ignores whitespace
at the beginning of the User ID (this is what motivates the
name-char-start production) and between the individual components in
the pgp-uid-convention production.  As is, the grammar only ignores
the 0x20 space character.  We may also want to include the tab
character, unicode's NO-BREAK SPACE (U+00A0) character and its
IDEOGRAPHIC SPACE (U+3000) character for thoroughness.  But, since
software will normally concatenate the individual components, just
recognizing the ASCII space character here is probably fine.  Whatever
the case, I think we can safely ignore the rest of unicode's
whitespace characters:

  https://en.wikipedia.org/wiki/Whitespace_character

My pgp-uid-convention production also matches user ids without email
addresses, e.g., "Daniel Kahn Gillmor".  This is convenient.  Instead
of having to figure out why parsing failed (is it not valid UTF-8? is
it just missing an addr-spec?), we explicitly cover this common
pattern in the grammar.  I think this will significantly simplify code
that uses this interface: if there is an error, then the code can just
assume the User ID is trash and can be ignored.

In RFC 2822, "specials" are only allowed in a display name if they are
quoted.  dkg removes this requirements.  I think this is mostly
sensible, but it means that we can have User IDs like:
"<foo(_at_)example(_dot_)org> <foo(_at_)example(_dot_)org>" where the first
<foo(_at_)example(_dot_)org> is the display name and the second is the 
addr-spec.
I think we should exclude angle brackets from the display name.  In my
grammar, I have an "atext-specials" which is just RFC 2822 specials
without the angle brackets.


I'm a bit concerned about allowing the backslash character: with this
grammar, it is just a normal character, but for an RFC 2822 parser,
it's an escape character.  Since User IDs may be used in contexts
where RFC 2822 things are expected, we should be careful.  But, I fear
that if we reject it, we'll end up gratuitiously rejecting some
emojis.  ¯\_(ツ)_/¯.


:) Neal

_______________________________________________
openpgp mailing list
openpgp(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/openpgp