Re: Content-MD5

I have been trying to write a program to generate Content-MD5 headers
for Mime objects, and find some difficulty in interpreting RFC1864.

What it says there is that I am to compute the MD5 algorithm on "the
canonical form of the MIME entity's object", which means the form
before any Content-Transfer-Encoding (or after decoding same, if at the
receiving end). So far so good.

It then says:

"For textual data, this means the MD5 algorithm must be computed on data
in which the canonical form for newlines applies, that is, in which each
newline is represented by a CR-LF pair."

But what is textual data?


The basic rule I use is anything under the top-level text type or anything
that's encoded using 7bit or 8bit. (The full set of rules I use is actually a
lot more complex, but this is to deal with operating systems that support more
complex file organizations that simple streams.)

This is basically just common sense: Agents routinely mess with
the line terminators of text subtypes and things encoded as 7bit or
8bit, so these are the cases that need to be canonicalized.

Now I can see that Content-Type:
text/plain is textual, and doubtless text/html likewise. And
application/some-binary-executable us clearly not textual (and arbitrary
changes of CFLF to LF, or whatever the local notation demanded would be
disastrous).

But what about application/postscript? That is certainly readable as
text,


Hardly. In general Postscript is NOT text. It can contain arbitrary binary
sequences and even the parts that look to you like text can be sensitive to
what line terminators are used. (The format includes multiline byte counted
strings.) Unless PostScript is being carried around as 7bit or 8bit text you
have to treat it as binary.

This is described in the MIME RFCs, BTW.

and there is no special need to encode it as base64.


Sometimes there most certainly is such a need on anything short of a
binary transport.

Or image/fig
(don't know whether that is a recognised application type, but fig is
certainly a way of specifying images, and it comes as text). Or any
application/foo which the recipient might not understand, but would at
least like to check that the MD5 agrees?


In this case it would again depend on whether or not image/fig is sent
as 7bit or 8bit.

So I did some experimentation with Sun's dtmail (which has known bugs in
its Content-MD, but at least seems to get it right for attachments). I
gave it a shell script: it decided it was text plain, and put the CRs in
before computing the MD5. I then constructed a postscript file (draws
a little red circle) which, of course, on my Solaris system has lines
terminated with LF only. It recognised that application/postscript was
needed, it computed the MD5 on the LF version, and then encoded it in
base64 (which seems a neat way to pass the problem onto someone else).
But is it correct?


Yes it is.

And what if I choose to leave the encoding at 7bit?


Then you risk the Postscript content being damaged during transport.

Or if I receive an application/postscript in 7bit and want to check the
MD5?


Canonicalize and see if it matches, but understand that in this case the
content may have been damaged as the document was sent (and effectively before
the content-md5 field was added).

There are two attachments to this message. One is that postscript in
base64, and the other is exactly the same file without encoding (I
may have some difficulty is persuading my system to send it without
encoding,


Sounds like your system knows what it is doing.

                                Ned