[Top] [All Lists]

Re: Character set registration

1995-12-18 21:18:57
The following sections from the latest (version of September 5, 1995) 
HTTP 1.0 spec seem to be relevant:
3.6.1 Canonicalization and Text Defaults

Media types are registered in a canonical form. In general, entity bodies 
transferred via
HTTP must be represented in the appropriate canonical form prior to 
transmission. If
the body has been encoded via a Content-Encoding, the data must be in 
canonical form
prior to that encoding. However, HTTP modifies the canonical form 
requirements for
media of primary type "text" and for "application" types consisting of 
text-like records. 

HTTP redefines the canonical form of text media to allow multiple octet 
sequences to
indicate a text line break. In addition to the preferred form of CRLF, 
HTTP applications
must accept a bare CR or LF alone as representing a single line break in 
text media.
Furthermore, if the text media is represented in a character set which 
does not use
octets 13 and 10 for CR and LF respectively, as is the case for some 
multi-byte character
sets, HTTP allows the use of whatever octet sequence(s) is defined by 
that character set
to represent the equivalent of CRLF, bare CR, and bare LF. It is assumed 
that any
recipient capable of using such a character set will know the appropriate 
octet sequence
for representing line breaks within that character set. 

       Note: This interpretation of line breaks applies only to the 
contents of an
       Entity-Body and only after any Content-Encoding has been removed. 
       other HTTP constructs use CRLF exclusively to indicate a line 
       Content codings define their own line break requirements. 

A recipient of an HTTP text entity should translate the received entity 
line breaks to the
local line break conventions before saving the entity external to the 
application and its
cache; whether this translation takes place immediately upon receipt of 
the entity, or
only when prompted by the user, is entirely up to the individual 

HTTP also redefines the default character set for text media in an entity 
body. If a
textual media type defines a charset parameter with a registered default 
value of
"US-ASCII", HTTP changes the default to be "ISO-8859-1". Since the 
ISO-8859-1 [18]
character set is a superset of US-ASCII [17], this has no effect upon the 
of entity bodies which only contain octets within the US-ASCII set (0 - 
127). The
presence of a charset parameter value in a Content-Type header field 
overrides the

It is recommended that the character set of an entity body be labelled as 
the lowest
common denominator of the character codes used within a document, with the
exception that no label is preferred over the labels US-ASCII or 

and (from 3.4):
HTTP character sets are identified by case-insensitive tokens. The 
complete set of
tokens are defined by the IANA Character Set registry [15]. However, 
because that
registry does not define a single, consistent token for each character 
set, we define here
the preferred names for those character sets most likely to be used with 
HTTP entities.
These character sets include those registered by RFC 1521 [5] -- the 
US-ASCII [17] and
ISO-8859 [18] character sets -- and other names specifically recommended 
for use
within MIME charset parameters. 

     charset = "US-ASCII"
             | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
             | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
             | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"
             | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
             | "UNICODE-1-1" | "UNICODE-1-1-UTF-7" | "UNICODE-1-1-UTF-8"
             | token

In other words, HTTP specifically allows the use of multibyte character 
sets which do not use the CRLF sequence, more specifically 16-bit Unicode 
(unicode-1-1). It also recognizes that this differs from the behavior 
specified by MIME.

David Goldsmith
Senior Scientist
Taligent, Inc.
10201 N. DeAnza Blvd.
Cupertino, CA 95014-2233

<Prev in Thread] Current Thread [Next in Thread>