xsl-list
[Top] [All Lists]

Re: German character set problem(Umlaute)

2002-12-19 15:58:50
Thanks for your wide explaining answer. Indeed, I didn't noticed that the was form UTF-8 encoded. I inserted the code "<xsl:output method="html" encoding="iso-8859-1"/>" in the xsl file ... and everything works fine.

Greetings,
Andreas



Mike Brown wrote:
Andreas Schlegel wrote:
[ Charset windows-1252 unsupported, converting... ]

Hi,

we have the following problem with our internet application.
If the user make an input in a pure HTML form like "müller" the server (JAVA servlets with Tomcat 4.0.3) get "müller".


Not always. The encoding of the HTML document containing the form determines
(by convention, not standard) how the form data is escaped and sent in the
HTTP request to the servlet.

So if your HTML with the form contains <meta http-equiv="Content-Type"
content="text/html;charset=utf-8"> and the user hasn't overridden the encoding
in their browser, then the form is submitted with data encoded like
m%C3%BCller because byte pair C3 BC is how ü is represented in UTF-8. If the
form is iso-8859-1 encoded then you get m%FCller, because byte FC is how ü is
represented in iso-8859-1.

In the request, there's typically no indication of what encoding was used as
the basis for the %-escaping, so when converting this data to a String for
access in a "parameter" of the request, Tomcat makes a guess, using
iso-8859-1, last I checked -- someone correct me if they've changed it. Parameter is a heavily overloaded term; I try not to use it when talking about
HTML form data.

So as long as your HTML form is iso-8859-1 encoded and the user isn't doing
anything unusual, Tomcat tells you that it got a String like "m\u00FCller".


If the user make the input in a HTML form which was generated by the TransformerFactory of the package javax.xml.transform (j2sdk1.4.0_01) the server receives the String "mÃ?ller"!


Apparently your form is UTF-8 encoded, and the browser knows that, and
is sending the data like m%C3%BCller. Tomcat doesn't know about UTF-8
being used, so it thinks C3 and BC are iso-8859-1 bytes that map to
separate characters.

Either change your transformation to output the HTML form as iso-8859-1, or
have your servlet re-encode the String as iso-8859-1 bytes, then decode it
back into a String using utf-8.

Mike




XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



<Prev in Thread] Current Thread [Next in Thread>