comments on may draft


Hi there.  Here are Vincent Lau's and my comments to the recent
RFC-xxxx draft that Nathaniel put out.  The page numbers are from
the postscript version, but since there are three versions I'll include
a section number and some context too.

We've also followed Nathaniel's recommendation for marking comments
with nit/show-stopper/argument.

        Neil Katin

----------------------

Global comments

#1 (argument)
We feel that there has been way too much overloading of the
Content-type field.  It currently holds a class description of the
data, a data-type indicator, file names, character set descriptions,
etc.  It mixes both keyword identifiers as well as implicit positional
semantic values.  This has essentially introduced another layer of
header parsing on 822, with no corresponding benefit.

The content-type header should be unrolled into separate, distinct
headers to allow easier parsing, generation, and understanding of
the data involved.

----------------------

Page 7, section 2 (the content type header field).

#2 (argument)
We're trying to fathom the relationship between the content type's
type and subtype fields.  In particular, there are several unclear
portions of the specification.  Are sub-types unique, or can they
be repeated with different classes with different interpretations?
For example, there might be a text processor format (frame and
interleaf come to mind) that has both a close to human readable
representation as well as binary, machine dependent version.
Would you have a text-plus/frame and a binary/frame format?

So, to word the question in a different way, what is the semantic
relationship between types, subtypes, and the data?  How should
a recieving UA treat this information?

We don't see the benefit of the type/subtype concept.  Instead,
we recommend splitting the two into a Content-class and
Content-type field.  The Content-class field is a hint
to the receiving UA as to the type of the data; it is most
useful if the Content-type is not understood.  The Content-type
field definitively identifies the data type of the body-part.

One thing that is missing from the spec is a description of
what a "content-type" is supposed to represent.  How exact
should this description be?  We recommend the following
interpretation: The purpose of the content-type is to describe
the data fully enough that the recieving UA can either pick
an appropriate agent (viewer?) to deal with the data.


----------------------

Page 8, the nine predefined content-type

#3 (argument)

There seems to be no difference between audio, video, image, and
binary -- if you don't understand the particular subtype, then
there is nothing you can do with the data anyway.  And, in general,
these multimedia types will probably want to "inherit" all the
normal fields for binary anyway (such as file-name).  This is
a case where the bundling of type/subtype together seems inferior
to splitting things into a class hint and an actual data type.


----------------------

Page 9, content-transfer-encoding, the compressed encoding

#4 (argument)

We guess we're going to lose this argument.  We've tried to make
it before, but without any success.  But here goes:

"Compression" should *not* be a content encoding.  It should be
a separate pass in the process (the binary class suggests
the "conversions" identifier).  It is an important concept to
have, but this is the wrong place to put it.

Why not?  

1. Because it has nothing to do with the described purpose of
content-transfer-encoding.  The beginning of section 3 says that
transfer-encoding is to render messages capably of being transported
over 7 bit links.  Compression is a completely orthoganal issue.

2. transfer-encoding is determined by the transport path chosen.  In
particular RFC821 and various non-complient MTA define a defacto 7 bit
messaging path, limited to around 80 characters, etc.  This set of constraints 
is completely independent of the data being represented.
However, the right compression algorithm to use is exceedingly data 
dependent, and has nothing to do with the transport path picked.
Why combine these two concepts?  Let's leave them separate, and
then deal with compression as a data specific conversion?


----------------------

Page 14, section 3.2, base64 transport-encoding, the end-of-line marker

#5 (argument)

The motivation is clear for why the end-of-line concept was added.
However, it is not clear what the interpretation that a receiving
UA should put on the end-of-line indication.  Or, more specifically
what interpretation should be put upon its lack.

Is the receiving UA prohibited from making CR/LF conversions if the
comma is not present?  If not, have we really added any semantic
information?

Personally, we don't think that the ',' convention really is compelling.
It is, in the abstract, motiviated by a good goal but we believe that
binary standards either must be transmissions between like systems,
or must have mutually understood, well definied internal structures
to render information understandable.  For example, consider other
possible problems like byte ordering of the binary file.  Therefore
simply knowing which CR/LF pairs are textual is not enough info
if you don't understand the rest of the file anyway.

Finally, will changing base64 encoding by adding a comma obsolete
RFC-1113, or will they change too (it would be unfortunate to have
two encodings that are different)

----------------------

Page 16, section 4.2, "Content-description"

#6 (nit)

When we proposed this field, we had intended that it be a file name,
not a free-form text field.  If the filename attribute becomes
available to more than just the binary type, then we'll just use it.

Since this field name may be mistakenly treated as the "comment" field
in Content-Type, we recommend either to change this field name or
state the intention of this field clearly.


----------------------

Page 16, section 4.3, Optional content-size

#7 (nit)

"In such cases, boundary delimiters...  The size may be measured in either
bytes or lines."  This should also have "bits", right?


----------------------

Page 17, section 4.3, Optional content-size

#8 (nit to argument)

Is it legal to have more than one content size?  We forsee the wish
to mark messages with both number of bytes (for internal efficiency)
and number of lines (in order to avoid having to scan the message for
line count).

Hopefully, the answer is "yes" to this question; it would be nice if
the spec represented this.


----------------------

Page 19, 5.1, the Text content-type

#9 (show-stopper)

Character sets are currently identified for the text class of content-types.
The character set is represented as the sub-class of the "text" class,
but nowhere else are character sets identified.  There are several
problems with this approach:

Other classes than text have the need to identify content types.
For example, troff docs need this indication.  The example of a
binary "tar" file also could use the indication.  You might even
need it for a "binary" compound doc file that has embedded strings.

While it is true that many text document systems do embody a character
set identification within the document, not all do.  Since this is
reality, I think that a separate Content-character-set field should
be allowed for all body-parts; this field would default to US-ASCII
if not present.

This proposal costs nothing, and allows a broader set of applications
to use RFC-XXXX for non-US languages.

----------------------

Page 19, 5.1, the Text content-type

#10 (argument)

This is another character set comment.  It referrs to both ISO-10646
and ISO-2022.

Idenifying a character set is a "good thing".  But then you have to ask
what the UA (either end-user or gateway) is going to do with this
information.

Alas, ISO-10646 and ISO-2022 are less "character-sets" in the traditional
sense, as they are a character-set-registry-and-encoding standard.
What this means is that these two ISO standards say how to switch
the character sets (sometimes called pages).  Alas, in order to
find this information you typically have to scan the entire document
to glean this information.  It would be very useful to indicate,
at the top level of a body part, which actual character sets are
needed in order to display the document; this way the software can
tell if it can successfully translate (in the case of a gateway) or
display the document.

----------------------

Page 21, 5.2, The Multipart content-type

#11 (nit to argument)

I don't think this will be controvertial, but its hard to be sure.
The end of the first paragraph in this section discourages the use
of any field that doesn't begin with "Content".  I think the intended
purpose is to hint as to which fields are legal in a multipart header;
it has the unfortunate side effect of discouraging "X-" headers;
could we soften up the last two sentences in the paragraph?


----------------------

Page 21, 5.2, paragraph that begins "The Content-type field for multipart"

#12 (argument)

We've put in a version identifier for the multipart type.  But we
didn't say how it should be interpreted.  In particular, what should
a UA do if it see's something other than "1-S" or "1-P"?  Is the
first part numeric?  Should the UA parse the value based on a '-'
character?

We think we should tighten up the spec with respect to the interpretation
of the version number (like, "interpret as a number, and it has
to be strictly upwards compatible").

To make this easier to parse, we recommend splitting the serial/parallel
indication to a separate field ("Content-Parallel", if present, would
do fine).


----------------------

Page 22, "Overall, the body of a multipart message may be..."

#13 (argument)

This section makes text prefix and postfix areas legal, even
though the descriptive text discourages their use.  We would
like to strengthen this discouragement, by the specification
insist that the prefix and postfix areas be empty, with a
recommendation that recieving s/w not die if there is text
in these areas.

We believe that this will make it less likely that there will
actually be text here, which should simplify the design decisions
in making an X.400 gateway (X.400 has not provisions for the
prefix and postfix concept)


----------------------

Page 24, 5.3, "Text-plus/RichMail"

In general, we like the concept of having rich mail, but agree
with other commenters that it belongs in another RFC.

#14, (repeat of previous point)

This was covered in #9 above, but we disagree with the 2nd paragraph
in this section which tries to explain away the need for content
character sets in the text-plus category.  One of the examples
given (troff) contradicts the later assertion that you can get
along with only internal character set identification!


----------------------


Page 25, 5.3, richmail syntax # 3

#15 (nit)

Nathaniel's code fragment clears this up, but from the syntax rules
it is not clear that %foo(text) commands can nest.  In particular, rule
#3 is misleading -- it does not mention the need for ignoring ')'
characters until the inner commands are processed.


----------------------

Page 26, 5.3, "Richmail also differentiates between hard and soft..."

#16 (nit)

This paragraph says that soft line breaks are ignored.  Do they
still get treated as "word breaks"?  This would be the conventional
interpretation, but the spec is not quite clear on this point.

----------------------

Page 27, 5.4, Message content type, multipart messages

#17 (nit)

The "oc" field currently appears to be random text.  Is there
a reason why we don't suggest making it like a normal message-id
field?

Also, as a repeat of #1 and #2, putting this info in the content
type header really strikes us as ugly -- what is the motivation for
having positionally defined headers here?

----------------------

Page 34, Appendix I

#18 (nit)

The bullets all say "strongly discouraged".  The surrounding text
makes it clear that these are really rules about coping with
non-compliant MTA's, but the flip side of "strongly discouraged"
is "discouraged but OK...".

You might want to think about changing the wording to "Prohibited, but
done by some non-compliant MTAs"

----------------------


Well, you made it to the end.  Thanks for listening...