ietf-822
[Top] [All Lists]

Re: Character set registration

1995-12-18 20:25:57
It was my understanding (correct me if I'm wrong) that the requirement
for CRLF processing in text/* media types came from the fact that
recipients might be called upon to process MIME objects with
unrecognized text/* media types and/or unrecognized charset values,
and that by default, most recipients, even if they do not translate
charset values in any other way, might do end-of-line processing of
the recieved data before saving it to disk; thus, any charset that
does not have standard CRLF end-of-line convention might be subject to
mangling.

Pretty close. The key issue for email is the reality of present-day mailbox
formats, which almost always require conversion of CRLF to local line
terminator sequences. This conversion is usually done unconditionally on
unencoded material, usually before the receiver ever gets it and definitely
before the receiver's display capabilities are known, and given the very real
possibility that either bare CRs, bare LFs, or out-of-band sequences will
appear in place of CRLF on the local system, is the primary justification for
CRLF being the only allowable sequence for a line terminator as well as the
justification for CR and LF only being available for use in this context and no
other.

In the context of real time connection between sender and recipient
and the ability of the recipient to indicate the allowable media types
and charset values (such as found in HTTP), this is not an issue.

Actually, I tend to disagree -- I don't think HTTP is as insulated from this
problem as you do. While the negotation facilities are quite useful, it isn't
really acceptable to refuse to get something when there's no charset for it
that you happen to recognize. Viewers capable of doing powerful character set
translation operations are available and can be used in this situation. And
when they are used, its important that the rules for storing text locally not
clash with the rules for text on the wire.

Thus, in the HTTP context, it is usually deemed reasonable to allow
text/* media types to be sent with the assumption that the end-of-line
might be signified by CR, LF, CRLF, or even a charset-specific end of
line convention, with the recipient required to accept all three end
of line mechanisms at least, and to only request other charset
encodings if the recipient understands that encoding.

This is actually a strong argument *for* the requirement that CR and LF only
appear in the context of a line break. It's this restriction that lets you
redefine the meaning of bare CR and bare LF this way -- they have no definition
in MIME text outside of this usage, so you're free to assign one in a given
context.

Note that mail systems take exactly the same liberties for similar reasons --
this is normal and expected behavior. The only difference is that they don't do
it on the wire!

Without these restrictions on the meaning of CR and LF in the top-level text
type it is quite possible for bare CR and bare LF to have other meanings in  a
given character set. (As it happens I use formats all the time were this is the
case.) It is also possible for some sequence other than CRLF to be the line
terminator (e.g. CR-NULL-LF-NULL). Either of these has the effect of turning
the lax HTTP way of doing business into a landmine for the unwary.

In other words, I see the present language in the MIME specification as
essential to this sort of HTTP usage. I see no problem with HTTP declaring
that, "Since bare CR and bare LF have no assigned meaning in the context of
MIME text, they are hereby assigned semantics equivalent to a CRLF when they
appear in the body of a text/* in conjunction with HTTP." Consider what the
problems would be if MIME instead defined CR to mean, say, "return to left
margin without an index operation", or perhaps allowed them to be used a
graphic characters, or used as part of multibyte sequences, or whatever.

I personally believe this is a HTTP issue and not an HTML issue.

I'd say this is a fair assessment.

However, you might note the wording of section 4.2.2 of RFC1866.TXT:

It seems to be saying more or less the same thing as what I've said here, but I
agree with you that this is more of an HTTP thing than an HTML thing.

                                Ned