ietf-822
[Top] [All Lists]

Re: [ietf-822] Base64 encoding details

2020-09-18 09:31:38
Hi all,

it has become customary to encode text messages in base64.  This is presumably
an attempt to avoid breaking the body hashes of DKIM signatures.

The usual reason it's done is so that you don't have to fix your HTML
generator to produce shorter lines, or lines at all.

I have not noted any increase in doing this, although YMMV.

However, notwithstanding RFC 2045 saying:

    Care must be taken to use the proper octets for line breaks if base64
    encoding is applied directly to text material that has not been
    converted to canonical form.  In particular, text line breaks must be
    converted into CRLF sequences prior to base64 encoding.

I note that, apparently based on the OS where the encoding is done, text is
often encoded using LF line endings, instead of CRLF.  This list, for one, does 
so.

Would it make sense to note this feature explicitly in the header?  For
example, one could write:

     Content-Transfer-Encoding: base64; nocr; column-width=76

That data would allow to reproduce the exact encoding that was signed, which
was the reason to use base64 in the first place.

Huh? Why are you decoding and then re-encoding the parts of your message, while
attempting to preserve a signature over the encoding? This entire process is
bound to fail, as there is no required canonical form for base64 output, let
alone quoted-printable.

And for that matter, why are you attempting to preserve DKIM signatures while
playing games with the message content? DKIM is intended to sign messages in
transit. It is not intended, or designed for, other signature applications.

All that said, if you really want to allow this sort of thing, the way to do it
isn't to try and figure out how to describe the myriad representations allowed
in our current set of encodings. That's never going to work, because people
will never stop coming up with variations you haven't thought of.

What you do instead is sign the material under the encoding, eliminating the
encoding variations from the signature. And while you're at it, you do it in
stages, so that when you're dealing with really large parts you can reuse
previosuly computed hashes.

More specifically, you sign a MIME message body by first hashing the unencoded
leaf body parts and the part headers minus the CTE fields and removing the
boundary= parameters from the CT headers. You then compute a Merkle tree of the
hashes corresponding to the MIME tree structure. (There are some details
missing here, and there are various security issues that have to be dealt with
some care, but this is the basic idea.)

You now have a hash that will survive the sort of manipulations you seem to
want to do.

Nick Shelness and I proposed this scheme more than 20 years ago, and I doubt we
were the first to come up with it - given the tree structure of MIME, a Merkle
tree is an obvious fit. The problem with it is the benefits have never been
seen to outweigh the costs. Sure, there are cases where not having to
compute the hash is a real win, or not having to store the original message is
a win, but given hardware advances it never seems to be enough?

Thoughts?

First and foremont, changing the syntax of the CTE header in an incompatible
way, as this proposal does, would cause no end of trouble.

As far as line terminators go, the standard is clear. If you don't want to
follow it you're depending on other implementations' tolerance of your crap.

Tolerating this sort of crap is fairly easy, which of course is why folks can
get away with it enough of the time that they aren't forced to fix it. And
tolerating this sort of crap isn't made any easier by providing a means
of announcing it. So there's no benefit to your scheme there.

So even if this propsal was modified to, say:

  Content-transfer-encoding-notes: crap=nocr; column-width=76

all it really does is introduce possible silly states.

Of course nothing prevents you from doing all this locally. If you want to take
messages apart and store only the pieces, nothing prevents you from writing
down the format details of the encodings that were used so you can try and
reproduce them later. You aren't really helped at all by asking the people
generating the encodings to produce these sorts of labels.

                                Ned

_______________________________________________
ietf-822 mailing list
ietf-822(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf-822