In the W3C XML SIG, Kurt Conrad and I wrote this summary for the
discussion of XML media types.
Kurt Conrad wrote...
Proposal:
MIME types (text/xml and application/xml) for XML documents
have the charset parameter. The encoding method is determined
by this parameter only. All other information is for error
recovery only.
Criteria:
This issue has been very controversial. Which should determine
encoding, the charset parameter of the MIME header, or the encoding
PI (or BOM)?
This issue is very closely related with the next issue
(text/xml and/or application/xml?), as the top-level type
"text" provides have the charset parameter.
There are two relevant RFC's, namely RFC 2130 (The Report of the IAB
Character Set Workshop) and RFC 2046 (MIME Part Two: Media Types).
RFC 2130 provides a guideline for the use of character sets on the
Internet. For XML to be a good citizen in the Internet, we have to
follow this guideline wherever possible.
In RFC 2130, determination of character encoding is a
protocol issue. RFC 2130 clearly recommends the use of MIME
headers to determine character encoding (Character Encoding
Scheme in the terminology of RFC 2130).
3.3: Determining which values of CCS, CES, and TES are used [RFC 2130]
To completely specify which CCS, CES, and TES are used in a specific
text transmission, there needs to be a consistent set of labels for
specifying which CCS, CES, and TES are used. Once the appropriate
mechanisms have been selected, there are six techniques for attaching
these labels to the data.
The labels themselves are named and registered, either with IANA
[IANA] or with some other registry. Ideally, their definitions are
retrievable from some registration authority.
Labels may be determined in one of the following ways:
- Determined by guessing, where the receiver of the text has to
guess the values of the CCS, CES, and TES. For example: "I got
this from Sweden so it's probably ISO-8859-1." This is
obviously not a very foolproof way to decode text.
- Determined by the standard, where the protocol used to transmit
the data has made documented choices of CCS, CES, and TES in the
standard. Thus, the encodings used are known through the
access protocol, for example HTTP [HTTP] uses (but is not
limited to) ISO-8859-1, SMTP uses US-ASCII.
- Attached to the transfer envelope, where the descriptive labels are
attached to the wrapper placed around the text for transport.
MIME headers are a good example of this technique.
- Included in the data stream, where the data stream itself has
been encoded in such a way as to signal the character set used.
For example, ISO-2022 encodes the data with escape sequences to
provide information on the character subset currently being used.
- Agreed by prior bilateral agreement, where some out-of-band
negotiation has allowed the text transmitter and receiver to
determine the CCS, CES, and TES for the transmitted text.
- Agreed to by negotiation during some phase, typically
initialization of the protocol.
3.3.1: Recommendations for value specification mechanisms [RFC 2130]
While each of these techniques (with the exception of guessing) is
useful in particular situations, interoperability requires a more
consistent set of techniques. Thus, we recommend that MIME
registered values be used for all tagging of character sets and
languages UNLESS there is an existing mechanism for determining the
required information using one of the other techniques (except
guessing). This recommendation will require a fair bit of work on
the part of protocol designers, implementors, the IETF, the IESG, and
the IAB.
The top-level media type "text" already provides the charset
parameter (RFC2046). Thus, if we use text/*, encoding
should determined by this parameter only.
4.1.2. Charset Parameter [RFC 2046]
A critical parameter that may be specified in the Content-Type field
for "text/plain" data is the character set. This is specified with a
"charset" parameter, as in:
Content-type: text/plain; charset=iso-8859-1
Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive. The default character set, which
must be assumed in the absence of a charset parameter, is US-ASCII.
The specification for any future subtypes of "text" must specify
whether or not they will also utilize a "charset" parameter, and may
possibly restrict its values as well. For other subtypes of "text"
than "text/plain", the semantics of the "charset" parameter should be
defined to be identical to those specified here for "text/plain",
i.e., the body consists entirely of characters in the given charset.
In particular, definers of future "text" subtypes should pay close
attention to the implications of multioctet character sets for their
subtype definitions.
The charset parameter for subtypes of "text" gives a name of a
character set, as "character set" is defined in RFC 2045. The rules
regarding line breaks detailed in the previous section must also be
observed -- a character set whose definition does not conform to
these rules cannot be used in a MIME "text" subtype.
We have to use the top-level type "application" for
transmitting XML documents in UTF-16 or UCS-2 via the SMTP
protocol, because of the line termination rule of MIME.
However, even in this case, RFC 2046 suggests the charset
parameter (4.1.2).
Other media types than subtypes of "text" might choose to employ the
charset parameter as defined here, but with the CRLF/line break
restriction removed. Therefore, all character sets that conform to
the general definition of "character set" in RFC 2045 can be
registered for MIME use.
HTML 4.0 already uses the charset parameter.
5.2 Character encodings [HTML 4.0]
What this specification calls a character encoding is known
by different names in other specifications (which may cause
some confusion). However, the concept is largely the same
across the Internet. Also, protocol headers, attributes, and
parameters referring to character encodings share the same
name -- "charset" -- and use the same values from the [IANA]
registry (see [CHARSETS] for a complete list).
The "charset" parameter identifies a character encoding,
which is a method of converting a sequence of bytes into a
sequence of characters. This conversion fits naturally with
the scheme of Web activity: servers send HTML documents to
user agents as a stream of bytes; user agents interpret them
as a sequence of characters. The conversion method can range
from simple one-to-one correspondence to complex switching
schemes or algorithms.
How do we specify the charset parameter? HTML 4.0
talks about server configuration.
5.2.2 Specifying the character encoding [HTML 4.0]
How does a server determine which character encoding applies
for a document it serves? Some servers examine the first few
bytes of the document, or check against a database of known
files and encodings. Many modern servers give Web masters
more control over charset configuration than old servers do.
Web masters should use these mechanisms to send out a
"charset" parameter whenever possible, but should take care
not to identify a document with the wrong "charset"
parameter value.
It has been argued that casual users cannot set the charset
parameter. However, the most popular WWW server, namely
Apache, allows casual users to set the charset parameter
easily. A casual user only has to make a file named .htaccess
in his or her directory and add a line as below:
AddType 'text/xml; charset=utf-8' xml
(See http://www.apache.org/docs/mod/mod_mime.html#addtype).
Some WWW servers do not provide this feature (.htaccess),
but it is usually possible to use file extensions to specify
the charset parameter. For example, the file extension
"xml8" specifies the charset parameter "utf-8", if the WWW
server configuration file has a line as below:
type="text/xml; charset=utf-8" exts=xml8
References:
Apache HTTP Server Version 1.3 / Module mod_mime / Directive AddType
http://www.apache.org/docs/mod/mod_mime.html#addtype
HTML 4.0 Specification
http://www.w3.org/TR/REC-html40/
RFC 2130
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2130.txt
RFC 2045
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2045.txt
RFC 2046
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2046.txt
----
MURATA Makoto muraw3c(_at_)attglobal(_dot_)net