In [1], Tim wrote:
- there is the whole issue of the charset header.
0. Introduction
There have been a lot of discussion about the encoding issue. Rather
than repeating it, we have to begin with a better understanding of the
current situation. Here is my attempt to clarify the current (messy)
status. Although this memo is still incomplete, I hope this has some
points.
1. Textual resources
Many types of WWW resources are textual. Since many charsets are
in use, we have to determine the charset of each textual resource
so as to handle it correctly .
1.1 Documents
XML, HTML, and CSS of W3C have textual representations. Plain text
is certainly textual.
1.2 Programs
Source programs are textual and written in some encoding. When we
transmit source programs or compile/execute them on the fly, encoding
issues will arise.
- VBScript received from the server,
- Javascript received from the server,
- JSP source files at the server,
- perl programs at the server, etc.
1.3 Generation of documents by programs
On the www server, programs generate documents on the fly. These
programs have to specify encoding for such documents. APIs of most
programming languages allow charset specification
Furthermore, programs have to embed in-band signature, when it is
necessary. Most APIs and programming languages do not provide any
support. Rather, programmers have use "print" carefully so as to
create in-band signatures (e.g., meta tags).
- CGI programs,
- Servlets,
- Applets,
- XSLT stylesheets
1.4 Form data
Finally, text typed in forms of HTML and sent as multipart/form-data
via HTTP also reqiure encoding information.
- text typed in <textarea> of HTML,
- text typed in <input type="text"> of HTML, and
- file uploaded by <input type="file"> of HTML,
2. Current situation
There are already too many methods for determining the encoding.
I show a list of such methods and further show which
is used for which type of resource.
A: the charset parameter of MIME entities
B: in-band declaration (META tags of HTML, @charset of CSS,
encoding declarations of XML),
C: the charset attribute of the referring element (XML or HTML)
in the referring resource,
D: the charset of the referring resource (typically HTML),
E: the charset of the HTML document containing the <input> or
<textarea> element,
F: guessing based on bit patterns
G: configuration files
H: Manual intervention
2.1 XML documents received from the HTTP server
A: the charset parameter of media types such as
text/xml and application/xml
B: the encoding declaration in XML documents
Note: RFC 3023 certainly says A > B.
2.2 HTML documents received from the HTTP server
A: the charset parameter of the media type
text/html
B: Meta tags
F: Some browsers sniff the charset.
H: the menu for choosing the encoding
Note: The HTML 4.01 recommendation blesses both A and B, but
RFC 2854 (text/html) strongly recommends A only. However,
RFC 2854 references to HTML 4.
2.3 CSS stylesheets received from the HTTP server
A: the charset parameter of media types such as
text/css
B: @encoding in CSS stylesheets
C: the attribute "charset" of LINK elements of HTML 4.01;
the charset attribute of the stylesheet-linking PI.
F: Some browsers sniff the charset.
Note: The CSS recommendation blesses both A and B,
but RFC 2318 (text/css) merely mentions A.
2.4 XSLT stylesheets received from the HTTP server
A: the charset parameter of media types such as
text/xml and application/xml
B: the encoding declaration in XML documents
C: the charset attribute of the stylesheet-linking PI
Note: Use of C is incorrect.
2.5 plain text received from the HTTP server
A: the charset parameter of the media type text/plain
F: Some browsers sniff the charset.
H: the menu for choosing the encoding
2.6 XML documents which are stored at the server but have
not been transmitted to the client yet
B: the encoding declaration in XML documents
G: Apache provides the directive AddCharset for configuration
files.
2.7 HTML documents which are stored at the server but have
not been transmitted to the client yet
B: META tags in this document
C: This document may be referenced by some anchor elements
of HTML 4.01, which specify the charset attribute.
G: Apache provides the directive AddCharset for configuration
files.
2.8 An HTML document that is generated at the server on the fly
but has not been transmitted to the client yet
B: META tags specified in this document
Note: Generating programs typically specify the charset
*TWICE*: once for the encoding of the output
stream and once for generating meta tags.
2.9 An HTML document temporarily created at the client by XSLT
B: META tags specified in this document
Note: The encoding parameter of xsl:output can specify the charset.
Moreover, when the output method is HTML, this parameter
also generates an appropriate META tag.
2.10 CSS stylesheets which are stored at the server but have
not been transmitted to the client yet
B: @encoding in CSS stylesheets
C: the attribute "charset" of LINK elements of HTML 4.01;
the charset attribute of the stylesheet-linking PI.
G: Apache provides the directive AddCharset for configuration
files.
2.11 XSLT stylesheets which are stored at the server but have
not been transmitted to the client yet
B: the encoding declaration in XSLT stylesheets
C: the charset attribute of the stylesheet-linking PI.
G: Apache provides the directive AddCharset for configuration
files.
Note: Use of C is incorrect.
2.12 plain text stored at the server which are stored at the server
but have not been transmitted to the client yet
G: Apache provides the directive AddCharset for configuration
files.
2.13 text typed in <textarea> or <input type="text"> of HTML and
transmitted via HTTP
A: Each part of a multipart/form-data should have the charset parameter.
E: As the charset of such text, browsers typically use the charset
of the HTML page.
Note: Unfortunately, the charset parameter for parts of multipart/form-data
is not widely implemented.
2.14 file uploaded by <input type="file"> of HTML
A: Each part of a multipart/form-data should have the charset parameter.
Note: Unfortunately, the charset parameter for parts of multipart/form-data
is not widely implemented.
2.15 Javascript, VBScript, etc. received from the HTTP server
B: Script elements of HTML 4.01 provide the charset parameter.
D: the charset of the referring resource (typically HTML)
F: Some browsers sniff the charset.
Note 1: Since there are no media types for such programming languages,
the charset parameter is not available.
Note 2: Since scripts in such programming languages contains
many ASCII characters and a small number of non-ASCII
characters, guessing almost always fails.
Note 3: The referring resource may be an HTML document
temporarily created by XSLT at the client side.
Even when users create everything in Shift_JIS,
creates UTF-16 HTML documents and assumes the referenced
Javascript as UTF-16.
2.16 E-mail sent via SMTP
A: the charset parameter of MIME entities,
F: content sniffing
Note: The encoding of E-mail received by and stored at the
SMTP client is up to the mail program.
2.17 JSP pages
G: The pageEncoding attribute of the page directive of JSP 1.2.
3. Misc
3.1 Database
Typically, web servers are front ends for database systems.
Encoding issues will arise especially because legacy data
are in legacy encodings.
3.2 Content negotiation
We also have to consider content negotiation issues. If
configuration of the charset parameter is difficult, the
same thing applies to configuration for negotiation.
- charset negotiation,
- language negotiation,
- media type negotiation,
- CONNEG
4. Concluding remarks
Unfortunately, the encoding issue is complicated, inconsistent, and
incomprehensible. Furthermore, different patch levels of WWW browsers
behave slightly differently. As a result, it is extremely difficult
to internationalize Web applications. Many WWW developers in Japan
suffer.
I agree that we have to change the current situation. However, I also
think that we can easily impair the situation by shortsighted
"improvements". I believe that we strongly need a long-term plan.
In my understanding, I18N people at IETF and the I18N WG have always
believed authoritative use of the charset parameter. I believe that a
long-term solution is to design an XML-based language for WWW server
configuration and to reference to such configuration files from all
WWW technologies, and that we should avoid ad-hoc solutions wherever
possible. I feel that further promotion of meta tags, @charset,
and encoding declarations merely makes the situation worse.
P.S. I don't know which mailing list or working group is best
for this discussion. Probably, the I18N WG of W3C?
Cheers,
IBM Tokyo Research Lab / International University of Japan, Research Institute
MURATA Makoto (FAMILY Given)
-----------------------------------------------------------------------
[1] http://lists.w3.org/Archives/Public/www-tag/2002Jan/0177.html