Fwd: Determination of encoding/character set

In the W3C XML SIG, Kurt Conrad and I wrote this summary for the 
discussion of XML media types.


Kurt Conrad wrote...
Proposal:

MIME types (text/xml and application/xml) for XML documents
have the charset parameter.  The encoding method is determined
by this parameter only.  All other information is for error
recovery only.


Criteria:

This issue has been very controversial.  Which should determine
encoding, the charset parameter of the MIME header, or the encoding
PI (or BOM)?  

This issue is very closely related with the next issue
(text/xml and/or application/xml?), as the top-level type
"text" provides have the charset parameter.

There are two relevant RFC's, namely RFC 2130 (The Report of the IAB
Character Set Workshop) and RFC 2046 (MIME Part Two: Media Types).

RFC 2130 provides a guideline for the use of character sets on the
Internet.  For XML to be a good citizen in the Internet, we have to
follow this guideline wherever possible.

In RFC 2130, determination of character encoding is a
protocol issue.  RFC 2130 clearly recommends the use of MIME
headers to determine character encoding (Character Encoding
Scheme in the terminology of RFC 2130).

3.3:  Determining which values of CCS, CES, and TES are used [RFC 2130]

  To completely specify which CCS, CES, and TES are used in a specific
  text transmission, there needs to be a consistent set of labels for
  specifying which CCS, CES, and TES are used.  Once the appropriate
  mechanisms have been selected, there are six techniques for attaching
  these labels to the data.

  The labels themselves are named and registered, either with IANA
  [IANA] or with some other registry.  Ideally, their definitions are
  retrievable from some registration authority.

  Labels may be determined in one of the following ways:

  -  Determined by guessing, where the receiver of the text has to
     guess the values of the CCS, CES, and TES. For example: "I got
     this from Sweden so it's probably  ISO-8859-1."  This is
     obviously not a very foolproof way to decode text.
  -  Determined by the standard, where the protocol used to transmit
     the data has made documented choices of CCS, CES, and TES in the
     standard. Thus, the encodings used are known through the
     access protocol, for example HTTP [HTTP] uses (but is not
     limited to) ISO-8859-1, SMTP uses US-ASCII.
  -  Attached to the transfer envelope, where the descriptive labels are
     attached to the wrapper placed around the text for transport.
     MIME headers are a good example of this technique.
  -  Included in the data stream, where the data stream itself has
     been encoded in such a way as to signal the character set used.
     For example, ISO-2022 encodes the data with escape sequences to
     provide information on the character subset currently being used.
  -  Agreed by prior bilateral agreement, where some out-of-band
     negotiation has allowed the text transmitter and receiver to
     determine the CCS, CES, and  TES for the transmitted text.
  -  Agreed to by negotiation during some phase, typically
     initialization of the protocol.

3.3.1:  Recommendations for value specification mechanisms [RFC 2130]

  While each of these techniques (with the  exception of guessing) is
  useful in particular situations, interoperability requires a more
  consistent set of techniques.  Thus, we recommend that MIME
  registered values be used for all tagging of character sets and
  languages UNLESS there is an existing mechanism for determining the
  required information using one of the other techniques (except
  guessing).  This recommendation will require a fair bit of work on
  the part of protocol designers, implementors, the IETF, the IESG, and
  the IAB.


The top-level media type "text" already provides the charset
parameter (RFC2046).  Thus, if we use text/*, encoding
should determined by this parameter only.

4.1.2.  Charset Parameter [RFC 2046]

  A critical parameter that may be specified in the Content-Type field
  for "text/plain" data is the character set.  This is specified with a
  "charset" parameter, as in:

    Content-type: text/plain; charset=iso-8859-1

  Unlike some other parameter values, the values of the charset
  parameter are NOT case sensitive.  The default character set, which
  must be assumed in the absence of a charset parameter, is US-ASCII.

  The specification for any future subtypes of "text" must specify
  whether or not they will also utilize a "charset" parameter, and may
  possibly restrict its values as well.  For other subtypes of "text"
  than "text/plain", the semantics of the "charset" parameter should be
  defined to be identical to those specified here for "text/plain",
  i.e., the body consists entirely of characters in the given charset.
  In particular, definers of future "text" subtypes should pay close
  attention to the implications of multioctet character sets for their
  subtype definitions.

  The charset parameter for subtypes of "text" gives a name of a
  character set, as "character set" is defined in RFC 2045.  The rules
  regarding line breaks detailed in the previous section must also be
  observed -- a character set whose definition does not conform to
  these rules cannot be used in a MIME "text" subtype.


We have to use the top-level type "application" for
transmitting XML documents in UTF-16 or UCS-2 via the SMTP
protocol, because of the line termination rule of MIME.
However, even in this case, RFC 2046 suggests the charset
parameter (4.1.2).

  Other media types than subtypes of "text" might choose to employ the
  charset parameter as defined here, but with the CRLF/line break
  restriction removed.  Therefore, all character sets that conform to
  the general definition of "character set" in RFC 2045 can be
  registered for MIME use.


HTML 4.0 already uses the charset parameter.

5.2 Character encodings [HTML 4.0]

What this specification calls a character encoding is known
by different names in other specifications (which may cause
some confusion). However, the concept is largely the same
across the Internet. Also, protocol headers, attributes, and
parameters referring to character encodings share the same
name -- "charset" -- and use the same values from the [IANA]
registry (see [CHARSETS] for a complete list).

The "charset" parameter identifies a character encoding,
which is a method of converting a sequence of bytes into a
sequence of characters. This conversion fits naturally with
the scheme of Web activity: servers send HTML documents to
user agents as a stream of bytes; user agents interpret them
as a sequence of characters. The conversion method can range
from simple one-to-one correspondence to complex switching
schemes or algorithms.



How do we specify the charset parameter?  HTML 4.0 
talks about server configuration.

5.2.2 Specifying the character encoding [HTML 4.0]

How does a server determine which character encoding applies
for a document it serves? Some servers examine the first few
bytes of the document, or check against a database of known
files and encodings. Many modern servers give Web masters
more control over charset configuration than old servers do. 
Web masters should use these mechanisms to send out a
"charset" parameter whenever possible, but should take care
not to identify a document with the wrong "charset"
parameter value.


It has been argued that casual users cannot set the charset
parameter.  However, the most popular WWW server, namely
Apache, allows casual users to set the charset parameter
easily.  A casual user only has to make a file named .htaccess 
in his or her directory and add a line as below:

        AddType  'text/xml; charset=utf-8'    xml

(See http://www.apache.org/docs/mod/mod_mime.html#addtype).

Some WWW servers do not provide this feature (.htaccess),
but it is usually possible to use file extensions to specify
the charset parameter.  For example, the file extension
"xml8" specifies the charset parameter "utf-8", if the WWW
server configuration file has a line as below:

   type="text/xml; charset=utf-8" exts=xml8


References:

Apache HTTP Server Version 1.3 / Module mod_mime / Directive AddType
   http://www.apache.org/docs/mod/mod_mime.html#addtype

HTML 4.0 Specification
   http://www.w3.org/TR/REC-html40/

RFC 2130
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2130.txt

RFC 2045
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2045.txt

RFC 2046
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2046.txt




----
MURATA Makoto  muraw3c(_at_)attglobal(_dot_)net