Re: Content-MD5

But what is textual data?

The basic rule I use is anything under the top-level text type or anything
that's encoded using 7bit or 8bit. (The full set of rules I use is actually a
lot more complex, but this is to deal with operating systems that support 
more
complex file organizations that simple streams.)

OK, I see that makes sense. Presumably anything sent (or to be sent) Q-P
or Base64 is computed as it decodes (whether that turns out to be
canonical or not). Anything that is sent 8bit, and then encoded into
something different en route will presumably have been computed and sent
canonically, so the encoding will be on the canonical form and should be
checked as such. The only problem I can foresee is if some intermediate
agent decides to decode the base64, then canonicalises it before passing
it onwards. That might cause an MD5 falure at the ultimate destination,
but I cannot understand why an intermediate site should think of doing
such a thing, even if it could see Content-Type text/plain (but, you never
know :-( ).


I believe this is done in anticipation of agents that can deal with 8bit but
not MIME encodings. And regardless of whether or not you or I think this is a
good thing, it apparently happens fairly often. I see messages like
"autoconverted from quoted-printable to 8bit" in message headers associated
with text parts all the time. (Such messages are especially amusing when the
message is subsequently reencoded, as sometimes happens.) Of course it is only
a problem if the text wasn't in canonical form to begin with.

I can also see a problem if my postscript with naked NLs is decoded from
base64 at the far end, and then immediately canonicalized (because that's
what the remote postscript interpreter expects).


Alas, any attempt to automatically canonicalize Postscript as if it were text
is doomed to failure. Even examination of the source to determine what
terminators to use doesn't work: Since Postscript is a general-purpose
programming language with the ability to completely replace or augment the
process of reading program data, figuring out what line terminators are "right"
for a given Postscript object is equivalent to the halting problem. And as a
practical matter real-world Postscript programs can be so obfuscatory in nature
that approximate solutions don't work all that well either.

The common practice of embedding one PostScipt object inside of another also
leads to situations where different line terminators are needed at different
points in the same document. There are standards for doing this correctly
(so-called encapsulated Postscript) but in practice they often aren't
followed.)

One needs to be sure that
the MD5 checking is done by the mail agents before that happens, and not
by any postscript agent. But that is as it should be.


Yes.

However, all this ought really to have been spelled out explicitly in the
RFC, which certainly seems to be ambiguous as written, though your
interpretation would seem to be the only one which makes sense.


Probably so.

Hardly. In general Postscript is NOT text. It can contain arbitrary binary
sequences and even the parts that look to you like text can be sensitive to
what line terminators are used. (The format includes multiline byte counted
strings.) Unless PostScript is being carried around as 7bit or 8bit text you
have to treat it as binary.

Hmmm! All the postscript binary I have ever seen has been textualized in
hex or base64 or something, and provided with a postscript procedure to
read it in. I grant that one could write postscript that included pure
binary (and the means to read it in), but I have never seen any like that.


Well, all I can say is that you've been lucky. I encounter such usage
regularly. It is especially common when dealing with Postscript that contains
large, detailed images.

This is described in the MIME RFCs, BTW.

I saw nothing relevant in RFC2046 where application/postscript is
described.


Section 4.5.2, bullet item (7) calls out the possible binary nature of
Postscript explicitly.

BTW, and off topic, there are dire warnings in there about nasty
side effects of blindly obeying postscript sent by mail. Is it the case
that anything sent as PDF rather than postscript is immune from that?


I believe PDF is supposed to be immune to this sort of nonsense, however, I've
never done a careful inventory of the PDF format like the one done I did for
Postscript. And FWIW, PDF uses internal offsets heavily, which unless
interpreters are coded to carefully range check offset values could result in
security problems.

                                Ned