Re: Is 8BIT ESTMP really needed

In <200105081322(_dot_)JAA26827(_at_)astro(_dot_)cs(_dot_)utk(_dot_)edu> 
Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> writes:

actually I think it's clear (or should be) that the MD5 computation
should be completely independent of canonicalization.

That was the intention of RFC 1864, but unfortunately it does not deliver,
as its text is written. It provides for all "line endings" to be
canonicalized into CRLF before computing the hash, but it does not define
what is or is not a "line ending". For text/* types, that is no problem,
of course, but what about random application types?


What about them? AFAIK there has never been an application type defined which
specified that some sort of line ending canonicalization is to be performed.

If you're dealing with application data encoded as QP or base64 you can only
upconvert it to binary. If it starts out as 7bit or 8bit you can downconvert to
QP or base64, but once there it isn't going back.

For some of them, the
concept of "line ending" is perfectly clear. For others it is not (someone
mentioned PDF).


PDF is pure binary material. There are no line endings in it that can be safely
canonicalized. The same is true of Postscript -- if you try and canonicalize
the line endings in Postscript often as not you'll break it.

that is, even if the canonicalization was done improperly, the MD5 needs
to be computed over the form of the body part that exists *after*
canonicalization and *prior* to any content-transfer-encoding.

Yes, but the content-transfer-encoding provides a second opportunity to
canonicalize LF into CRLF (the encoding engine is likely separate from the
Copntent-MD5 engine) and so may introduce some CRLFs not appropriate for
that application type, and strange things may then arise upon upcoding.


No it doesn't. The steps are clear: First canonicalize, then hash, then encode.
Subsequent reencoding can not and in practice does not involve
recanonicalization.

I think the only safe way is to encode all CRLF in application types as
=0D=0A, giving
      foo=0Dbar=0Abax=0D=0A=CRLF
thus not relying on the CRLF at the end of the CTE lines for any meaning.


Sure, and this is exactly what implementations do in practice.

But encoding engines do not currently work this way :-( .


In my experience they most certainly do work this way. The only funkiness I've
ever seen with QP in practice is that sometimes you get text that's encoded
using =0D=0A=CRLF. This isn't by the book, but it doesn't break hashes and
decodes correctly, so why worry about it?

So it seems to
me that a Content-MD5 engine (assuming it has no opportunity to alter the
document, or to influence the encoding) is forced to try and 2nd guess
what the encoding is going to do (or has already done).


I have seen no need to do this in practice. I check content-md5 hashes all the
time and on the rare occasions when they don't match it is due to something
silly like trailing spaces being removed from text.

                                Ned