xsl-list
[Top] [All Lists]

Re: Fw: Select entire XML doc [FURTHER]

2003-02-28 15:15:10
Wow... that was "overwhelmingly" excellent.
Karl

Errr... I think I shall learn how to post XML from the client using
javascript and the XML dom ; )

Karl


----- Original Message -----
From: "Mike Brown" <mike(_at_)skew(_dot_)org>
To: <xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com>
Sent: Friday, February 28, 2003 2:56 PM
Subject: Re: [xsl] Fw: Select entire XML doc [FURTHER]


Karl Stubsjoen wrote:
Wow... that was most awesome.  Thanks for the help, it really made a lot
of
sense.  And indeed, I do need to be careful of HTML tags becoming
malformed.
Once the XML has been propery serialized in a text area element, what is
the
proper way to deserialize it?

Do you mean you want to turn

<someXmlData>&lt;tag&gt;chardata&lt;/tag&gt;</someXmlData>

into

<someXmlData><tag>charadata</tag></someXmlData>

?

...This is a FAQ and is generally beyond the scope of what XML should be
used
for, or what XSLT can do without extension functions. But if you insist,
you
will need to write an extension function that takes the content of the
someXmlData element (or any string, really), passes it into an XML parser,
and
converts the parser's results to a node-set or result tree fragment. See
your
XSLT processor docs for how to write an extension function (it varies).
Your
processor may already have such a function available (but likely not).

Or do you mean after the HTML has been rendered in the browser, and the
user
submits the form having the textarea with the possibly-edited XML? That's
a
whole 'nother can of worms, due to encoding issues, which I am all too
happy
to write about, although it is technically off-topic for this list.

First, in general, you should not be passing XML around in HTML form data,
if
the intent is to have a general-purpose XML editing system, although as
long
as you stick to pure ASCII, or just treat it as an uneditable binary file,
then things should be fine.

The problems begin with how form data is handled. A browser transmits the
form
data, which is Unicode, encoded as if it were going into a URL. This means
that certain characters in the ASCII range (code points 0 to 127) and all
characters beyond the ASCII range (code points 128 to 1114111) are first
encoded as bytes, then represented as ASCII bytes for the characters "%xx"
where xx is the hexadecimal representation for a byte. The ASCII-range
characters always use the us-ascii encoding as the basis for the
%-escaping,
while the non-ASCII characters typically (it's not enforced by any
standard)
use the encoding *of the HTML document containing the form from which this
data was submitted*.

So for example if you have in your textarea the character data "¡Hola
amigo!",
and the HTML with the form was utf-8 encoded, and the browser user didn't
override the interpreted encoding on their end, then the form will be
submitted using utf-8 as the basis for the %-escaped form data:

  %C2%81Hola%20amigo!

whereas if the HTML were iso-8859-1 encoded, it would be coming through as

  %81Hola%20amigo!

On the receiving end, the form data needs to be decoded. Most servers
provide
an API for receiving decoded form data in your application, be it CGI
environment variables or getParameter() methods on HTTP request objects or
what have you. But since most browsers do not communicate the details of
what
encoding they used as the basis for the %-escaping, the server makes a
guess,
and usually guesses wrong. So for example, while

   %C2%81Hola%20amigo!

unambigously means bytes

   C2 81 48 6F 6C 61 20 61 6D 69 67 6F 21

...the API might mistakenly assume that these are iso-8859-1 and will
decode
it for you into the string "À¡Hola amigo!". In fact, this happens quite
often.
So you'll have to be prepared to transcode: re-encode the string using the
same encoding that the server assumed, and then decode it using the
encoding
that you know the HTML form used (you might send the latter in a hidden
form
field). Either that, or pull the raw data out of the HTTP request and
properly
decode it yourself.

Once you have the properly decoded string, you can feed it to an XML
parser as
a Unicode string, so that the parser will ignore the encoding declaration
in
the XML's prolog. If you were to feed the raw bytes (the C2 81 48 etc
above)
to the parser, you would have to declare the encoding externally, because
there's a chance that the declaration in the prolog has become innacurate
while it was edited and reencoded.

You didn't know what you were getting into, did you? Like I said, in
general,
HTML forms and the server-side APIs for processing them are just not
equipped
to be a general-purpose XML editing system, at least not in an idiot-proof
way. The culprits are really HTTP and MIME; HTML is just working around
their
restrictions. And browser vendors choose the path of least disruption,
choosing not to implement some of HTML's features that could easily work
around some of these issues (e.g., they do have a way of transmitting
encoding
info, but they just don't do it, to "keep people's scripts from
breaking").

--
  Mike J. Brown   |  http://skew.org/~mike/resume/
  Denver, CO, USA |  http://skew.org/xml/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list




 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



<Prev in Thread] Current Thread [Next in Thread>