Re: Last Call: <draft-ietf-dane-openpgpkey-07.txt>

Executive summary for those who don't like my long and detailed
messages: In order to accommodate the extended mailboxes
permitted by RFC 6530 etc., this spec allows non-ASCII
addresses.  However, the use of almost-arbitrary Unicode strings
(in UTF-8) introduces issues of ambiguity in lower casing and
normalization, issues that are not addressed by either the
document of Edwin's proposal.  The document itself appears to me
to be unacceptable, even for publication as a recommended
experiment, unless those issues are addressed.

Details inline.
 

--On Sunday, February 14, 2016 3:04 PM +0000 E Taylor
<hagfish(_at_)hagfish(_dot_)name> wrote:

...
On the topic of lowercasing, it seems that there are still
differing opinions, and this is potentially reflected by the
implementations that now exist.  For example, the author has
pointed out some examples of clients which force lowercasing,
and I've checked that Mail.de (which supports adding OpenPGP
key information to the DNS[0]) replaces uppercase letters with
their lowercase equivalent when choosing a username at sign-up
(so presumably store only an entry for the lowercase version
in the DNS).  The online tester at openpgpkey.info[1] (run by
Mail.de) also forces email addresses to lowercase before
searching.
...
My suggestion for a consensus, therefore, is that the draft
recommend that clients attempt the case sensitive lookup
first, and then fall back to a lowercase lookup if that fails
(ideally informing the user that it has done this).  For the
rare situation where a user specifies an email address with
uppercase characters in, this will result in an extra query,
but in the rarer situation that the lowercase version doesn't
exist (or represents a different user) then this provides a
worthwhile security benefit.  Moreover, I think that if the
draft doesn't mention the possibility of lowercasing, then
client implementers will either force lowercasing out of
habit, or make their software search for both just to be sure,
as I have outlined above.


Temporarily and for purposes of discussion, assume I agree with
the above as far as it goes (see below).   Given that, what do
you, and the systems you have tested, propose to do about
addresses that contain non-ASCII characters in the local-part
(explicitly allowed by the present spec)?  Note that lowercasing
[1] and case folding are different and produce different results
and that both are language-sensitive in a number of cases, what
specifically do you think the spec should recommend?  

Also, do you think it is acceptable to publish this document
with _any_ suggestions about lower-casing or "try this, then try
something else" search without at least an "Internationalization
Considerations" section that would discuss the issues [1] and/or
some more specific recommendation than "try lowercase" (more on
that, with a different problem case, below).

Dropping that assumption of agreement for discussion, I
personally believe that this document could be acceptable _as an
Experimental spec_ with any of the following three models, but
not without any of them:

 (i) The present "MUST not try to guess" text.

 (ii) A recommendation about lowercasing along the lines
        you have outlined but with a clear discussion of i18n
        issues and how to handle them [2].

 (iii) A clear statement that the experiment is just an
        experiment and that, for the purposes of the experiment,
        addresses that contain non-ASCII characters in the local
        part are not acceptable (note that would also require
        pulling the UTF-8 discussion out of Section 3 and
        dropping the references to RFC 6530 and friends).

To be sure I understand what you are suggesting and save a
separate note, neither the EAI specs (RFC 6530 et al.) nor the
text in section 3 of the current document specify that
local-part strings are required to be normalized.   For such
strings, even when they are entirely in lower case when
presented by the user, there may be multiple different forms,
e.g., 
   U+0066 U+006F U+0308 U+006F   and
   U+0066 U+00F6 U+006F
are perfectly good (and SMTPUTF8-valid) representations of the
string "föo"    

Using the same theory as your lower case approach, would you
recommend trying first one of those and then the other [3]?

The more I think about it, the more I'm convinced that the
specification and allowance for UTF-8 [4] in the first bullet of
Section 3 is unacceptable without either text there that much
more carefully describes (and specifies what to do about) these
cases or an "Internationalization Considerations" section that
provides the same information.  I suggest that anyone
contemplating writing such text carefully study (not just
reference) Section 10.1 of RFC 6530.   Of course, simply
excluding non-ASCII local-parts from the experiment, as
suggested in (iii) above, would be an alternative.  I have mixed
feelings about whether it would be an acceptable one for an
experiment.  I am quite sure it would not be acceptable for a
standards-track document when the EAI work and/or the IETF
commitment to diversity are considered.

john


[1] For the benefit of those who are blissfully unaware of these
problems and sticking to Latin script as an example, consider
the the lower case form of "A".   Define "lower case form" as
characters that can produce "A" under at least some
circumstances.  The examples toward the right are less likely
than those to the left, but all are lower case forms for "A".
   a à á â ã ä

If that example doesn't cause either an insight or an adequately
bad headache, consider the lower case possibilities for "I",
including whether they are dotted or dotless and that either the
dotted or dotless forms can appear in combination with some of
the diacritical markings above.

[2] I note that, historically, the DNS community has been very
reluctant to accept techniques that depend on or imply multiple
lookups for a single perceived object and, separately, for
"guess at this, try it, and, if that does not work, guess at
something else" approaches.  Unless those concerns have
disappeared, the potential for combinatorial explosion when
lower-casing characters that may lie outside the ASCII
repertoire is truly impressive.

[3] In case it isn't clear, while "föoēy" has only one NFC and
one NFD form as a string, if unnormalized forms are allowed, it
would imply the potential for up to four lookups, not two.

[4] Nit: "encoded in UTF-8" is not actually a sufficient
statement.  The correct statement would be similar to "Unicode
encoded in UTF-8".