Re: 2 MIME questions re: message/rfc822

On Fri November 5 2004 13:28, ned+ietf-822(_at_)mrochek(_dot_)com wrote:

What's really needed is a generic way of computing a hash of a MIME object 
that
takes as many of these issues as possible into account. I've had the
specification of such a thing on my to-do list literally for years but I 
never
seem to find the time to finish writing it up.

Basically what you want to do is define a hash methodology that computes
separate hashes on leaf nodes in the MIME object and then combines those
separate hashes along with hashes of canonicalized headers and the MIME
structure itself in a specific way to arrive at a single result. The
advantages of this approach are numerous:

1. The "combines" part is likely to pose some problems because there
    are some conflicting characteristics which one would like to have:

    * it should be easy to combine and separate the hashes


Only hash combination is needed, not separation.

    * the result should be sensitive to reordering of objects


This is quite simple.

    * it should be possible to determine if some object and its
       corresponding hash has been removed (or if some content
       has been added)


This one is next to impossible to do IMO, and falls far outside
the goals I would have to such a mechanism.

    * for large objects, a hash represents data reduction; as the
       ratio of object size to hash length increases, there is a
       reduction in sensitivity (ability to detect changes). The act
       of combining hashes should not result in additional decrease
       in sensitivity


Sufficiently strong hashes make this a nonissue, I believe.

    * for small objects, a hash may be larger than the object itself;
       as a message is split into a greater number of smaller objects
       which are individually hashed, the total size of the hashes
       grows, resulting in undesirable increased overhead


You're assuming that all the separate hashes are present; this is not how the
scheme works. Rather, the hashes are themselves hashed.

2. Canonicalization of headers presents several problems:
    a) it presupposes that the syntax and structure of each field is known,
        which can become a snag as new fields are defined, when user-
        defined or experimental fields are used, as new protocol element
        keywords (charset names, media types and subtypes, etc.) are
        registered, etc.   A similar problem exists for RFC 2047 encoded-
        words; because they can appear only in certain contexts (some
        comments, in phrases, and in unstructured fields), one needs to
        know detailed field syntax  to determine if something which
        looks like an encoded-word is in fact an encoded-word.
   b)  Unfolding, normalization and compression of whitespace are
        probably reasonable for structured fields, but differences in line
        folding, tabs vs. spaces, and quantity of whitespace characters
       may be significant in some instances in unstructured fields
        (Subject, Comments, Content-Description, etc.).
   c) Some objects are case-sensitive, others case-insensitive; that
       should be taken into account during canonicalization. However,
       in some instances is is simply not possible to determine whether
       some field text is a case-insensitive object or not.


I'm well aware that in general there is no way to do header canonicalization
and hashing perfectly. The question is how far is far enough.

(1) Encodings can be changed without breaking signatures. (This can help
    with handling whitespace, and it makes it possible for signatures to
    survive 8->7 conversion.)

Maybe. Change in encoding should result in change of
Content-Transfer-Encoding fields.  How would one handle that
change to the MIME-part header associated with the encoded
part, bearing in mind that once an object has been encoded, there
is no record of whether it was encoded from an original specified
as 8bit or as binary (which might or might not have been mislabeled)?


Simple: You exclude the CTE field from the hash. And yes, I know this has some
issues and corner cases, but IMO the benefits far outweigh the costs.

(2) Boundary markers can be changed without breaking signatures. (How
    to handle preamble and postamble text is an interesting side issue 
here.)

Would changing boundary markers not also change the MIME-part
header (boundary parameter in Content-Type field)?


Content-type is the one field that simply must be canonicalized in a way
that avoids this problem.

[...]

So, is it time for me to finish the specification for this?

I suspect that 2a above might be a show-stopper.


I disagree.

Does anybody
care, and more to the point, will anybody actually implement it?

And if so, will there be multiple interoperable implementations (which
continue to interoperate as new fields, charsets, media types and
subtypes, etc. develop)?   Would such a method interoperate with
current methods?  How would the different method be indicated:
MIME-Version: 2.0?  Won't there still be problems if the original
isn't in canonical form when signed (and won't more complex
rules add to that problem)?  Won't there still be problems with
non-MIME-aware message handlers and with legacy MIME
implementations?


This has nothing to do with MIME proper and everything to do with specific
signature schemes. No changes to MIME are needed or expected for this. And yes,
I realize this approach has the same problems as the separate specification of
mutlipart/signed, and for the same reasons. But as I said before, some things
are too late to change.

                                Ned