Re: Content-Canonicalization: crlf?

I am certainly not advocating sending anything by canonical CRLF line
terminated text on the wire.  I've been around enough to have seen this
discussion before and implemented enough MIME stuff to know that it is an
impractical and useless thing to do and will lead to each MUA having to
implement a multitude of text formats.


Then I don't understand what you are proposing. Unless you are proposing the
addition of some sort of indicator in the on-wire message format this is
entirely outside of our scope here.

And if you are proposing the insertion of such a field, then I don't know what
its for unless you also define it in such a way to have some sort of
interaction with the actual content. As a receiver of a given object, I could
not care less what the transmitter had to do in order to get it into canonical
form, and I certainly would not trust such information to tell me how to deal
with an object.

The issue I was trying to address is how a MUA knows whether or not it
should convert CRLF into the local line convention or not for a given MIME
leaf part.  An example of where a problem would occur with the current
practice would be some type under application which is textual data that
should be converted to the local line format.  One example that comes to
mind might be some scripting language that is locally executed (security
implications aside) as well as edited.


MUAs have to know how to convert local material of a given type into whatever
canonical format is used for that material and vice versa. This is something
you have to "just know" -- there is no way a header field can provide this
information, since it cannot exist before the object exists! Similarly, there
is no way the sender can "just know" what the receiver should do, since the
reciever's notion of a local form for a given type may be (and often is)
completely different from that used by the sender.

To be concrete, the sender of type application/script-z follows Appendix G
and first creates the local form, a plain text file with script code in it.
Then it converts the line ends from the local format to CRLF, applies B64
encoding and sends it.  The receiver removes the B64 encoding and then does
not know whether or not to convert the CRLF's to local line convention or
not.


I understand the problem. What I do not understand is how adding such a header
provides any sort of solution whatsoever. Let's say the sender canonicalized
the object before sending it. Fine. Does this tell me anything? No it doesn't,
since whether or not I have to convert to local form depends on what my local
form is, not what the sender's was.

If what you're proposing is that the sender provide a suggestion to the
receiver as to whether or not conversion from canonical form will be needed,
then the information is useless since the sender has no way of knowing what
the receiver's local form is.

Finally, if what you're proposing is an indicator of whether or not a given
object has been canonicalized to CRLF format, then all this does is introduce
redundant information, since the requirement of a particuar media type
dictate whether or not CRLFs need to be present and what they mean if they
are. All this does is introduce a silly state where the field might
possibly different from what the receiver knows should be the case.

My suggestion for Content-Canonicalization was that the field have only two
values: "crlf" or none (meaning binary) to indicate whether or not
canonicalization has been applied.  Maybe other canonicalizations could be
added later if one becomes clear for a set of content types, but it's
probably best to ignore that completely right now.


OK -- let's suppose I implement this on VMS. I send out two text files. The
first happens to be in variable length record format, so I canonicalize
it and say so. The second is a stream file, already in CRLF format, so I
say "no, I did not canonicalize it".

How is this helpful to the recipient of the material?

I could go through the other possible uses of such a facility, but surely
you get the idea.

Larry's message suggested another possibility -- that the indication that
canonicalization has been applied is implied by the content-type.  This
does seem reasonable and it seems to be the current practice, though I
haven't found any place that this is explicitly stated.


The canonicalization rules required are implicit in the definition of
the content-type format itself. When you talk about this operation or
that operation you inevitably end up assuming that certain local forms
are being used.

This is the essential problem with all such prose -- dictating a transformation
from local to canonical form requires a statement of what the local form is.
And this we do not know, plus it doesn't belong in a standards document anyway.
A Best Current Practices document, maybe, but not a standard.

Appendix G only describes the encoding process and not the decoding process
where this becomes an issue.


First of all, use of the term "encoding" is inappropriate here. The process
may include encoding as one of its steps, but it isn't the only thing that
is done.

Second, to use your terminology, the "decoding" process is simply the inverse
of the "encoding" process. What could be simpler?

I have no problem with stating this explicitly, and I have added prose
along these lines to the current MIME drafts.

If canonicalization is implied by the content
type, it is not stated anywhere.


The process you are talking about is the inverse of an existing process, and
the fact that how canonicalization is done is implied by the content type is
mentioned, I believe.

For example, the description of
content-type text doesn't say that its canonical representation has CRLF
line endings.


I beg to differ. This is stated, not once but several times. Quoting
directly from MIME part 2:

  The canonical form of any MIME text type MUST represent a line break as a CRLF
  sequence.  Similarly, any occurrence of CRLF in text MUST represent a line
  break.  Use of CR and LF outside of line break sequences is also forbidden.

I don't think this could be any clearer.

Right now there are differences in implementations.  Munpack
converts CRLF for type text/* only, and Pine converts for type text/* and
messages/rfc822.  It seems to me some text that describes the decoding
process and a statement that says text/* always should have CRLF line
ending would be a good thing.


Sure there are differences, but the differences are in the local form
that is being supported, not in the canonical representation. Munpack
is apparently aiming for the situation where message/rfc822 is treated
as regular text, whereas Pine is aiming for the situation where the
message will then be handled as a message in the local mailbox format. In
fact I would not be at all surprised if the rules change with different
versions of Pine and Munpack on different platforms -- if they don't they
certainly should!

Going on a bit further, I think there are some things about current
practice that make this issue more confusing: implementations which store
822 messages as text files usually convert them to local end of line
conventions before any MIME parsing or decoding is ever performed.  This is
true of sendmail and smail implementations on UNIX and thus true for any
POP clients talking to UNIX servers.  The result is that *any* content type
that is not B64 encoded will have all it's CRLF's converted to the local
line convention.  Thus far this isn't a huge problem, but it probably not
well understood.


Actually, I think it is very well understood, even including its implications
in regards to binary MIME.

There is even an explicit discussion of how the conversion process may need
to be shortcircuited to deal with these and similar issues in the
canonicalization model section.

Also, I believe it is possible that implementors of some
content types may try to take advantage of this.  For example if
application/script-z mentioned above were to not use B64 encoding and leave
its line endings exposed, they would get converted to the local line
format.  This is probably somewhat perverse, but I think it may be assumed
by new MIME implementors that look at how things currently work and don't
think about the spec carefully.

The use of security multiparts (or encryption of any MIME data) is the
thing that first brought this issue to my attention because there may be no
end of line canonicalization on the whole MIME-gram before the MIME
decoding.


I don't know what this means. The security multiparts and MOSS documents
state quite explicitly that they deal with material in canonical format.

I believe not performing conversion to the local format until
MIME parsing is complete may actually be correct, but it is different from
what mostly happens with UNIX implementations today.  The fact that Pine
currently converts message/rfc822 to local line endings seems to be an
emulation of what currently happens on UNIX implementations and may be
wrong.  (I actually wrote the first version of this in Pine, so it could be
my fault that message/rfc822 is converted -- it's been too long and I can't
remember.)

So, I do think something needs to be clarified (aside from my own messages).


Well, I agree to the extent that what you call "decoding" needs to be
called out as the inverse of "encoding". I always thought this was
obvious, but a clear statement of it could not hurt.

But that's as far as I go. Much of what you see as clarification I see as
obscuring the issues. The bottom line is that MIME plus the collectiong of
defined types presents a single, consistent format for material. Extensive
discussion of the various shortcuts that are used in dealing with such material
may be both interesting and useful, but has no place in the standard itself.

                                Ned