The following sections from the latest (version of September 5, 1995)
HTTP 1.0 spec seem to be relevant:
-----------------------------------------
3.6.1 Canonicalization and Text Defaults
Media types are registered in a canonical form. In general, entity bodies
transferred via
HTTP must be represented in the appropriate canonical form prior to
transmission. If
the body has been encoded via a Content-Encoding, the data must be in
canonical form
prior to that encoding. However, HTTP modifies the canonical form
requirements for
media of primary type "text" and for "application" types consisting of
text-like records.
HTTP redefines the canonical form of text media to allow multiple octet
sequences to
indicate a text line break. In addition to the preferred form of CRLF,
HTTP applications
must accept a bare CR or LF alone as representing a single line break in
text media.
Furthermore, if the text media is represented in a character set which
does not use
octets 13 and 10 for CR and LF respectively, as is the case for some
multi-byte character
sets, HTTP allows the use of whatever octet sequence(s) is defined by
that character set
to represent the equivalent of CRLF, bare CR, and bare LF. It is assumed
that any
recipient capable of using such a character set will know the appropriate
octet sequence
for representing line breaks within that character set.
Note: This interpretation of line breaks applies only to the
contents of an
Entity-Body and only after any Content-Encoding has been removed.
All
other HTTP constructs use CRLF exclusively to indicate a line
break.
Content codings define their own line break requirements.
A recipient of an HTTP text entity should translate the received entity
line breaks to the
local line break conventions before saving the entity external to the
application and its
cache; whether this translation takes place immediately upon receipt of
the entity, or
only when prompted by the user, is entirely up to the individual
application.
HTTP also redefines the default character set for text media in an entity
body. If a
textual media type defines a charset parameter with a registered default
value of
"US-ASCII", HTTP changes the default to be "ISO-8859-1". Since the
ISO-8859-1 [18]
character set is a superset of US-ASCII [17], this has no effect upon the
interpretation
of entity bodies which only contain octets within the US-ASCII set (0 -
127). The
presence of a charset parameter value in a Content-Type header field
overrides the
default.
It is recommended that the character set of an entity body be labelled as
the lowest
common denominator of the character codes used within a document, with the
exception that no label is preferred over the labels US-ASCII or
ISO-8859-1.
---------------------------
and (from 3.4):
--------------------
HTTP character sets are identified by case-insensitive tokens. The
complete set of
tokens are defined by the IANA Character Set registry [15]. However,
because that
registry does not define a single, consistent token for each character
set, we define here
the preferred names for those character sets most likely to be used with
HTTP entities.
These character sets include those registered by RFC 1521 [5] -- the
US-ASCII [17] and
ISO-8859 [18] character sets -- and other names specifically recommended
for use
within MIME charset parameters.
charset = "US-ASCII"
| "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
| "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
| "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"
| "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
| "UNICODE-1-1" | "UNICODE-1-1-UTF-7" | "UNICODE-1-1-UTF-8"
| token
----------------------
In other words, HTTP specifically allows the use of multibyte character
sets which do not use the CRLF sequence, more specifically 16-bit Unicode
(unicode-1-1). It also recognizes that this differs from the behavior
specified by MIME.
David Goldsmith
Senior Scientist
Taligent, Inc.
10201 N. DeAnza Blvd.
Cupertino, CA 95014-2233
david_goldsmith(_at_)taligent(_dot_)com