ietf
[Top] [All Lists]

APPSDIR review of draft-farrell-decade-ni-07

2012-06-05 04:43:41
Hello everybody,

[For replies, please trim the cc list, thanks!]


I have been selected as the Applications Area Directorate reviewer for this draft (for background on appsdir, please see http://trac.tools.ietf.org/area/app/trac/wiki/ApplicationsAreaDirectorate ).


Please resolve these comments along with any other Last Call comments you may receive. Please wait for direction from your document shepherd or AD before posting a new version of the draft.


Document: draft-farrell-decade-ni-07
Title: Naming Things with Hashes
Reviewer: Martin Dürst
Review Date: 2012-06-03, 2012 (written up 2012-06-04/05)
IETF Last Call Date: started 2012-06-04, ends 2012-07-02


Summary: This draft addresses a real generic need, but the current form of the draft is the result of adding more and more special cases without a clear overall view and a firm hand to separate the wheat from the chaff. This shows both in the technical issues as well as in many of the editorial issues below. This draft is not ready for publication without some serious additional work, but that work is mostly straightforward and should be easy to complete quickly.



Major design issue:

The draft defines two schemes, which differ only slightly, and mostly just gratuitously (see also editorial issues). These are the ni: and the nih: scheme. As far as I understand, they differ as follows:
                                    ni:                nih:
authority:                          optional           disallowed
ascii-compatible encoding:          base64url          base16
check digit:                        disallowed         optional
query part:                         optional           disallowed
decimal presentation of algorithm:  disallowed         possible

The usability of URIs is strongly influenced by the number of different schemes, with the smaller a number, the better. As a somewhat made-up example, if the original URIs had been separated into httph: for HTML pages and httpi: for images, or any other arbitrary subdivision that one can envision, that would have hurt the growth and extensibility of the Web. Creating new URI schemes is occasionally necessary, and the ideas that lead to this draft definitely seem to warrant a new scheme (*), but there's no reason for two schemes. [(*) I know people who would claim the the .well-formed http/https thing is completely sufficient, no new scheme needed at all.]

More specifically, if the original URIs had been separated into httpm: (for machines) and httph: (for humans), the Web for sure wouldn't have grown at the speed it did (and does) grow. In practice, there are huge differences in human 'speakability' for URIs (and IRIs, for that matter); compare e.g. http://google.com with http://www.google.co.jp/#sclient=psy-ab&hl=en&site=&source=hp&q=hash&oq=hash&aq=f&aqi=g4&aql= (which I have significantly shortened to hopefully eliminate potential privacy issues), or compare the average mailto: URI with the average data: URI. However, what's important is that there never has been a strong dividing line between machine-only and human-only URIs or schemes, the division has always been very gradual. Short and mainly human-oriented URIs have of course been handled by machines, and on the other hand, very long URIs have been spoken when really necessary. "Speakability" has been maintained to some extent by scheme designers, and to some extent by "survival of the fittest" (URIs that weren't very speakable (or spellable/memorizable/guessable/...), and their Web sites, might just die out slowly).

It should also be noted that the resistance against multiple URI schemes may have been low because there are so many different ways to express hashes in the draft anyway, and one more (the nih: section is the last one before the examples section) didn't seem like much of a deal anymore. But when it comes to URIs, one less is a lot better than one more.

In the above ni:/nih: distinction, nih: seems to have been added as an afterthought after realizing that reading an ni: URI aloud over the phone may be somewhat suboptimal because there is a need for repeated "upper case" - "lower case" (sure very quickly shortened to "upper" - "lower" and then to "up" - "low" or something similar). It is not a bad idea to try to make sure that IETF technology, and URIs in particular, are accessible to people with certain kinds of dislexya. (There are indeed people who have tremendous difficulties with distinguishing upper- and lower-case letters, and this may or may not be connected with other aspects of dislexya.) It is however totally unclear to this reviewer why this has to lead to two different URI schemes with other gratuitous differences.

Finding a solution is rather easy (of course, other solutions may also be possible): Merge the schemes, so that authority, check digit, and query part are all optional (an authority part and/or a query part may very well be very useful in human communication, and a check digit won't hurt when transmitted electronically) and the decimal presentation of the algorithm is always allowed, and use base32 (http://tools.ietf.org/html/rfc4648) as the encoding. This leads to a 16.6% less efficient encoding of the value part of the ni: URI, but given that other URI-related encodings, e.g. the %-encoding resulting when converting an IRI to an URI, are much less efficient, and that URI infrastructure these days can handle URIs with more than 1000 bytes, this should not be a serious problem. Also, there's a separate binary format (section 6) that is more compact already.



(relatively) Minor technical issues:

Section 2, "When the input to the hash algorithm is a public key value": Is it absolutely clear that this will work for any and all public key values, existing and future, and not only for what's currently around? After all, as far as I understand, the concept of a public key is a fairly general one.

"Other than in the above special case where public keys are used, we do not specify the hash function input here. Other specifications are expected to define this.": Do you really expect that to happen? Wouldn't it be better limit variability here as much as possible, and to use media types to identify different kinds of data? This would also work for public keys: If there's a MIME media type for a SubjectPublicKeyInfo, then the fact that this media type is the preferred way to transfer a public key becomes an application convention rather than a special case in the spec. If a better way (or just another way) to encode/transfer public keys became popular at a later date, there would be no need to change the spec.

Related, in Section 3:
   The "val" field MUST contain the output of base64url encoding the
   result of applying the hash function ("alg") to its defined input,
   which defaults to the object bytes that are expected to be returned
   when the URI is dereferenced.
How do I know whether the default applies or not? The URI doesn't tell you. Deducing from context is a bad idea.

Section 3: "Thus to ensure interoperability, implementations SHOULD NOT generate URIs that employ URI character escaping": This is wrong and needs to be fixed. Characters such as "&", "=", "#", and "%", as well as ASCII characters not allowed in URIs and non-ASCII characters MUST be %-encoded if they appear in query parameter values in URIs (or in query parameter tags, which is however less likely). It would be better if the spec here deferred to the URI spec rather than trying to come up with its own rules.

Section 3: "The Named Information URI adapts the URI definition from the URI Generic Syntax [RFC3986].": This sounds as if this were a voluntary decision (and the text should be changed to avoid such an impression), but if you don't conform to RFC 3986 syntax, you're not an URI. This is the first time I have seen an URI scheme definition starting explicitly with the top ABNF rule from RFC 3986 (http://tools.ietf.org/html/rfc3986#appendix-A). This is completely unnecessary. Just make sure your production conforms to the generic URI syntax, and mention all the ABNF rules from RFC3986 that you use.

Also, using the "URI" production from RFC 3986, and then silently dropping the #fragment part, is technically wrong. Scheme definitions have nothing to do with the fragment (including the question of whether there's a fragment or not; the semantics of fragments are defined by the MIME media type that you get when you resolve). This may not be completely clear in RFC 4395, but the IRI WG is working on an update of RFC 4395 where this will be made clearer (see also http://trac.tools.ietf.org/wg/iri/trac/ticket/126; thanks for giving me a chance to remember that I had to create a new issue in the tracker for this :-).

Section 3, ABNF:
            ni-hier-part   = "//" authority path-algval
                             / path-algval
This gives you ni://example.com/sha-256;f4OxZX_x_FO5... (//authority/) and ni:/sha-256;f4OxZX_x_FO5... (one slash only), but the examples show ni:///sha-256;f4OxZX_x_FO5... (three slashes). It looks like the ABNF you want is:
            ni-hier-part   = "//" authority path-algval
                           / "//" path-algval
(aligning "=" and "/" helps!)
or more simply:
            ni-hier-part   = "//" [authority] path-algval
or even more simply:
            ni-hier-part   = "//" authority path-algval
because authority can be empty; let's show this:
   authority     = [ userinfo "@" ] host [ ":" port ]
If we can show that host can be empty, we're done:
   host          = IP-literal / IPv4address / reg-name
If we can show that any one of these can be empty, we're done, let's pick reg-name:
   reg-name      = *( unreserved / pct-encoded / sub-delims )
* means "zero or more", thus reg-name can be empty. QED.

Section 4:
   The HTTP(S) mapping MAY be used in any context where clients without
   support for ni URIs are needed without loss of interoperability or
   functionality.
What is meant by "support for ni"? There's nowhere in the spec where this is explained clearly. If I were a browser maker, or writing an URI library,..., what would I do to support the ni scheme? The only thing I have come up with is to covert ni to the .well-known format, then use HTTP(S). In that case, the above text seems wrong, as it says that .well-known is used when there's no support for ni, not in order to support ni.

Section 5: This defines an "URL segment format". It seems to be limited to path componest in HTTP URIs. What if I want to use this in a query part, or maybe even as a fragment identifier? What if I want to use this as a path component in an FTP URI? Or in some other schem? It would be better to define the alg-val (see next point) part as such (before the other things), with an explanation along the following lines: "This is defined here both for use in other sections of this document as well as for use in other places where it may be helpful, such as HTTP URI path segments,..."

Section 5 (and Section 3): "To do this one simply uses the "alg;val" production": There is no "alg;val" production. Please change to "To do this one simply uses the <alg-val> production" and fix the ABNF in section 3 to
            path-algval = "/" alg-val
            alg-val     = alg ";" val
It's probably even better to fold this in with the changes to ni-hier-part, resulting e.g. in:
            ni-hier-part   = "//" authority "/" alg-val
            alg-val     = alg ";" val

Section 9.4: Status can be 'empty' or 'deprecated'. I suggest to replace 'empty' with something positive, such as 'valid' or 'active'. This will help people who go to the IANA page and start to ask "well, it doesn't have a status, what does that mean". Also, I strongly suggest to add an additional status 'reserved', and remove the current "Reserved" hash name string from the entries with IDs 0 and 32.

Section 9.4: "The Suite ID value 32 is reserved for compatibility with ORCHIDs [RFC4843].": How will compatibility be kept for future changes/additions in ORCHID?



Major editorial issues:

Title and abstract (and the spec itself) use the wording "Naming Things". While in a security context, it may be that there is an implicitly assumption that there are only digital things, in a wider context, this is of course not true. Research on the Internet of Things and efforts such as the Semantic Web/Linked Data try to deal with things in the real world. People in these areas it will be confused by title, abstract, and text, unless you can show (me and) them an ni: hash for a person, an apple, a building, or an elephant. Therefore, while it may be possible to keep the catchy title, the abstract has to be fixed to avoid such misunderstandings, e.g. by changing "to identify a thing" to "to identify a digital object" or some such in the abstract, and likewise in the main text of the spec.

"Human-speakable" (e.g. ), "human-readable" (e.g. section title of section 7), and "for humans" (e.g. section title of section 9.2): These terms are used throughout the spec, but are imprecise and confusing. First, there's the problem of interpreting "for humans" in the sense of the previous paragraph, which of course has to be fixed. But the main problem is that none of the "ni:" URIs are "non-human-readable" or "non-human-speakable". Reading them aloud is only somewhat more tedious, but not at all impossible. And because the value part of the nih: form is 50% longer, and people quickly develop conventions for shortening things such as "upper case" and "lower case", it's not even clear that reading aloud the nih: form will necessarily take that much time. Therefore, I strongly recommend to change all occurrences of "Human-speakable", "human-readable", "for humans", and the like, to the more precise "more easily read out aloud by humans" or something equivalent.

Abstract and further on: "specifying URI, URL": By all URx theories (see e.g. http://www.w3.org/TR/uri-clarification/), URLs are a subset of URIs, and therefore saying that the spec specifies an URI and an URL is somewhat confusing. I'd propose using wording along the following lines: "specifying an URI scheme and a way to map these URIs to http".

Section 2, "When the input to the hash algorithm is a public key value", and example section: It took me a while to understand that the "public key" stuff was not yet another way to present a hash, and also not a way to mix in a public key to the hash in order to obtain some specific security property (I wasn't able to figure out how that would work, but draft-hallambaker-decade-ni-params contains something similar involving digital signatures and a public key). The document would be much easier to understand if there was a section e.g. entitled "Forms of input to hash", with subsections e.g. "general data", "public keys", "other stuff (not defined in this document)". As it is written, the relevant paragraphs in section 2 look like an afterthought, and it's not clear to what. Also, the example section should be fixed as follows: 1) say upfront that there will be two examples, one for a short string and another for a public key. 2) Make sure both examples exercise all forms (the public key example seems to be pretty complete, but the "Hello World!" example seems to be incomplete). 3) Use the same form of presentation (either a table in both cases or short paragaphs in both cases.
The caption on Figure 7 is also way too unspecific.

Section 9.4: "Hash Name Algorithm Registry", and later "a new registry for hash algorithms as used in the name formats specified here": IANA will be helped tremendously if your draft comes with an easy-to-understand and unambiguous name for the new registry. "Hash Name Algorithm Registry" may be okay, but is probably not specific enough. The circumscription at the start of the section is definitely not good enough because you're not registering hash algorithms, but names of hash algorithms and their truncations.



Minor editorial issues:

Introduction: It would be good to have a general reference to hashing (for security purposes) for people not utterly familiar with the subject.

Intro: After reading the whole document, the structure of the Intro seems to make some sense, but it didn't on first reading (where it's actually more important). The main problem I was able to identify was that after a general outlook in paragraph 1, the Intro drops into a list of examples without saying what they are good for. I suggest to, after the sentence "This document specifies standard ways to do that to aid interoperability.", add a sentence along the lines: "The next few paragraphs give usage examples for the various ways to include a hash in a name or identifier as they are defined later in this document.". It may also make sense to further streamline the following paragraphs, so that it is clearer which pieces of text refer each to one of the "standard ways".

There are two instances of the term "binary presentation". Looking around, it seems that they are supposed to mean the same as "binary format". Please replace all instances of "binary presentation" with "binary format" to avoid misunderstandings and useless seach time.

Section 3: "A Named Information (ni) URI consists of the following components:": It would be good to know exactly where the list ended. One way to do this would be to say "consists of the following nine components".

Section 3: "Note that while the ni names with and without an authority differ syntactically, both names refer to the same object if the digest algorithm and value are the same.": What about cases with different authority? The text seems to apply by transitivity, but this may be easy to miss for an implementer. I suggest changing to: "Note that while ni names with and without an authority, and ni names with different authorities, differ syntactically, they all refer to the same object if the digest algorithm and value are the same.".

Section 3: "Consequently no special escaping mechanism is required for the query parameter portion of ni URIs.": Does this mean "no escaping mechanism at all"? Or "nothing besides %-encoding"? Or something else? Please clarify.

Figure 3: the "=" characters of the various rules should be aligned as much as possible to make it easier to scan the productions (see http://tools.ietf.org/html/rfc3986#appendix-A for an example).

Section 3:
            unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
                ;  directly from RFC 3986, section 2.3
                ; "authority" and "pct-encoded" are also from RFC 3986
Please don't copy productions. Please don't copy half (or one-third, actually) of the productions you use, and reference the rest. Please don't say what productions you copy from where in a comment, and even less in a comment for an unrelated production. Please before the ABNF, say which productions are used from another spec.

Section 4:
   The HTTP(S) mapping MAY be used in any context where clients without
   support for ni URIs are needed without loss of interoperability or
   functionality.
This is difficult to understand. If some new functionality is proposed, it's usually a client *with* the new functionality that's needed, not one without. Also, the "without loss of interoperability or functionality" is unclear: Sure if ni isn't supported, there's a loss in interoperability. So I suggest to rewrite this as:
   The HTTP(S) mapping MAY be used in any context where clients with
   support for ni URIs are not available.
(but see also the comment in minor technical issues)

Section 6: "binary format name": Why 'name'? Why not just "binary format"? The later is completely clear in the context of the document or together with an indication of the document; for something that can be used independently, even "binary format name" isn't enough.

Section 6: "suite ID": The word "suite" seems out of place here. In the general use of the term, it refers to "a group of things forming a unit or constituting a collection" (see http://www.merriam-webster.com/dictionary/suite). A good definition that works for the uses I'm familiar with in digital security would be "An algorithm suite is a coherent collection of cryptographic algorithms for performing operations such as signing, encryption, generating message digests, and so on." (http://fusesource.com/docs/framework/2.4/security/MsgProtect-SOAP-SpecifyAlgorithmSuite.html; disclaimer: I'm in no way a SOAP fan). The use here is not for a collection, but for a single truncated-length variant of a single hash algorithm. I seriously hope you can find a better name.

Section 6: "Note that a hash value that is truncated to 120 bits will result in the overall name being a 128-bit value which may be useful with certain use-cases.": This left me really wondering: Is there something magic to 128 bits in computer/internet security? What are the "certain use cases"? Or is this just an example to make sure the reader got the relationships, and it could have been as well "Note that a hash value that is truncated to 64 bits will result in the overall name being a 72-bit value which may be useful with certain use-cases." (or whatever other value that's registered in section 9)?

Section 7: Just for the highly unfortunate case that this doesn't disappear, it would be very helpful if the presentation of this section paralleled section 3.

Section 7: "contain the ID value as a UTF-8 encoded decimal number": I'm an internationalization expert with a strong affection for UTF-8, but even for me, this should be "contain the ID value as an ASCII encoded decimal number".

Section 9: The registration templates refer to sections. This is fine for readers of the draft, but not if the template is standalone. I suggest using a format such as that at http://tools.ietf.org/html/rfc6068#section-8.1, which in draft stage may look e.g. like http://tools.ietf.org/html/draft-duerst-eai-mailto-03#section-8.1.

Section 9.3: "Assignment of Well Known URI prefix ni" and later (and elsewhere in the draft) "URI suffix": Are we dealing with a prefix or a suffix here?

Section 9.4: "This registry has five fields, the binary suite ID,...":
Better to remove the word "binary", because the actual number is decimal.

Section 9.4: "The expert SHOULD seek IETF review before approving a request to mark an entry as "deprecated." Such requests may simply take the form of a mail to the designated expert (an RFC is not required). IETF review can be achieved if the designated expert sends a mail to the IETF discussion list. At least two weeks for comments MUST be allowed thereafter before the request is approved and actioned.": I'm at a loss to see why asking the IETF at large is a SHOULD, but if it's done, then the two weeks period is a MUST.

Section 9.4: The registry initialization in Fig. 8 refers to RFC4055 many times. But RFC 4055 does in no way define SHA-256. It looks like the actual spec is http://tools.ietf.org/html/rfc4055#ref-SHA2 (National Institute of Standards and Technology (NIST), FIPS 180-2: Secure Hash Standard, 1 August 2002.) I think this should be cited, in particular because there is a "Specification Required" requirement, and this sure should mean that there is a Specification for the actual algorithm, and not just a specification that mentions some labels. So using RFC4055 as a reference could be taken as creating bad precedent.

Section 9.4: "The designated expert is responsible for ensuring that the document referenced for the hash algorithm is such that it would be acceptable were the "specification required" rule applied.": Why all this circumscription? Why not just say something like: "The designated expert is responsible for ensuring that the document referenced for the hash algorithm meets the "specification required" rule."



Nits:

Author's list: Last time I heard about this, there was a general limit of 5 authors per RFC. I'm not sure whether this still exists, and what'd be needed to get around it, but I just wanted to point out that this may be a potential problem or additional work (hoops to get through).

Intro: "Since, there is no standard" -> "Since there is no standard"

Intro: "for these various purposes" -> "for these purposes" or "for various purposes" (the indefinite 'various' is incompatible with the definite 'these').

"2. Hashes are what Count" -> "2. Hashes are what Counts" (the former may look logically correct, but 'what' requires a singular verb form.

Section 2: "the left-most or most significant in network byte order N bits from the binary representation of the hash value" -> "the left-most (or most significant in network byte order) N bits from the binary representation of the hash value" or "the left-most N bits, or the N most significant bits in network byte order, from the binary representation of the hash value" (the current text is virtually unparsable).

Figure 1: The 0x notation is never explained. A short clause or pharse is all that would be needed, but it would be better if this were spelled out.

Section 3, Query Parameter separator: "The query parameter separator acts a separator between" -> "The query parameter separator acts *as* a separator between".

Section 3, Query Parameters: "A tag=value list of optional query parameters as are used with HTTP URLs" -> "A tag=value list of optional query parameters as used with HTTP URLs" (or "A tag=value list of optional query parameters as they are used with HTTP URLs").

Section 4: "the object named by the ni URI will be available at the corresponding HTTP(S) URL" -> "the object named by the ni URI will be available via the corresponding HTTP(S) URL" (via stresses the point that this should be done via (sic) redirection)

Section 4: "so there may still be reasons to use" -> "so there can still be reasons to use" (better to use can because non-normative; the document otherwise does a good job on this)

Section 10: "Note that fact that" -> "Note the fact that", or much better: "Note that".


Regards,     Martin.