ietf-xml-mime
[Top] [All Lists]

Fwd: Default of the charset parameter

2000-03-22 09:46:00
In the W3C XML SIG, Kurt Conrad and I wrote this summary for the 
discussion of XML media types.


Kurt Conrad wrote...
Proposal:

The default of the charset parameter of text/xml and
application/xml is UTF-8 rather than US-ASCII (RFC 2045) or
ISO-8859-1 (RFC 2068 [HTTP/1.1]).

Criteria:

The default of this parameter is an interesting issue.
There are conflicting RFC's and a recommendation.  

In RFC 2046 (MIME: Media types), the default is US-ASCII.

4.1.2.  Charset Parameter [RFC 2046]
[snip]
  Unlike some other parameter values, the values of the charset
  parameter are NOT case sensitive.  The default character set, which
  must be assumed in the absence of a charset parameter, is US-ASCII.

In RFC 2068 (HTTP/1.1), the default is ISO-8859-1.

3.7.1 Canonicalization and Text Defaults [RFC 2068]
[snip]
  The "charset" parameter is used with some media types to define the
  character set (section 3.4) of the data. When no explicit charset
  parameter is provided by the sender, media subtypes of the "text"
  type are defined to have a default charset value of "ISO-8859-1" when
  received via HTTP. Data in character sets other than "ISO-8859-1" or
  its subsets MUST be labeled with an appropriate charset value.

HTML 4.0 further overrides this decision.

5.2.2 Specifying the character encoding  [HTML 4.0]
[snip]
The HTTP protocol ([RFC2068], section 3.7.1) mentions
ISO-8859-1 as a default character encoding when the
"charset" parameter is absent from the "Content-Type" header
field. In practice, this recommendation has proved useless
because some servers don't allow a "charset" parameter to be
sent, and others may not be configured to send the
parameter. Therefore, user agents must not assume any
default value for the "charset" parameter.

To address server or configuration limitations, HTML
documents may include explicit information about the
document's character encoding; the META element can be used
to provide user agents with this information.

For example, to specify that the character encoding of the
current document is "EUC-JP", a document should include the
following META declaration:

<META http-equiv="Content-Type" content="text/html;
charset=EUC-JP"> The META declaration must only be used when
the character encoding is organized such that ASCII
characters stand for themselves (at least until the META
element is parsed). META declarations should appear as early
as possible in the HEAD element.

For cases where neither the HTTP protocol nor the META
element provides information about the character encoding of
a document, HTML also provides the charset attribute on
several elements. By combining these mechanisms, an author
can greatly improve the chances that, when the user
retrieves a resource, the user agent will recognize the
character encoding.

RFC 2130 (The Report of the IAB Character Set Workshop)
provides a guideline for the use of character sets on the
Internet.  RFC 2130 recommends UTF-8 as the default for new
protocols.

0: Executive summary [RFC 2130]
  This report recommends the use of ISO 10646 as the default Coded
  Character Set, and UTF-8 as the default Character Encoding Scheme in
  the creation of new protocols or new version of old protocols which
  transmit text. These defaults do not deprecate the use of other
  character sets when and where they are needed; they are simply
  intended to provide guidance and a specification for interoperability.

Since XML is a new application in the Internet, the best
default is UTF-8, as recommended by RFC2130.  There is no
need to change existing HTTP/1.1 Web servers.  There is no
need to consider backward compatibility of already installed
XML documents.  We can start from scratch.

One potential drawback is fallback to text/plain.  Since the
default of HTTP/1.1 is ISO-8859-1, fallback to text/plain
might cause corrupted data.  However, we do not think that 
this is a major problem.  


References:

HTML 4.0 Specification
   http://www.w3.org/TR/REC-html40/

RFC 2130
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2130.txt

RFC 2045
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2045.txt

RFC 2068
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2068.txt



----
MURATA Makoto  muraw3c(_at_)attglobal(_dot_)net

<Prev in Thread] Current Thread [Next in Thread>