Re: Fw: Select entire XML doc [FURTHER]

Wow... that was "overwhelmingly" excellent.
Karl

Errr... I think I shall learn how to post XML from the client using
javascript and the XML dom ; )

Karl


----- Original Message -----
From: "Mike Brown" <mike(_at_)skew(_dot_)org>
To: <xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com>
Sent: Friday, February 28, 2003 2:56 PM
Subject: Re: [xsl] Fw: Select entire XML doc [FURTHER]

Karl Stubsjoen wrote:

Wow... that was most awesome.  Thanks for the help, it really made a lot

of

sense.  And indeed, I do need to be careful of HTML tags becoming

malformed.

Once the XML has been propery serialized in a text area element, what is

the

proper way to deserialize it?


Do you mean you want to turn

<someXmlData>&lt;tag&gt;chardata&lt;/tag&gt;</someXmlData>

into

<someXmlData><tag>charadata</tag></someXmlData>

?

...This is a FAQ and is generally beyond the scope of what XML should be

used

for, or what XSLT can do without extension functions. But if you insist,

you

will need to write an extension function that takes the content of the
someXmlData element (or any string, really), passes it into an XML parser,

and

converts the parser's results to a node-set or result tree fragment. See

your

XSLT processor docs for how to write an extension function (it varies).

Your

processor may already have such a function available (but likely not).

Or do you mean after the HTML has been rendered in the browser, and the

user

submits the form having the textarea with the possibly-edited XML? That's

whole 'nother can of worms, due to encoding issues, which I am all too

happy

to write about, although it is technically off-topic for this list.

First, in general, you should not be passing XML around in HTML form data,

if

the intent is to have a general-purpose XML editing system, although as

long

as you stick to pure ASCII, or just treat it as an uneditable binary file,
then things should be fine.

The problems begin with how form data is handled. A browser transmits the

form

data, which is Unicode, encoded as if it were going into a URL. This means
that certain characters in the ASCII range (code points 0 to 127) and all
characters beyond the ASCII range (code points 128 to 1114111) are first
encoded as bytes, then represented as ASCII bytes for the characters "%xx"
where xx is the hexadecimal representation for a byte. The ASCII-range
characters always use the us-ascii encoding as the basis for the

%-escaping,

while the non-ASCII characters typically (it's not enforced by any

standard)

use the encoding *of the HTML document containing the form from which this
data was submitted*.

So for example if you have in your textarea the character data "¡Hola

amigo!",

and the HTML with the form was utf-8 encoded, and the browser user didn't
override the interpreted encoding on their end, then the form will be
submitted using utf-8 as the basis for the %-escaped form data:

  %C2%81Hola%20amigo!

whereas if the HTML were iso-8859-1 encoded, it would be coming through as

  %81Hola%20amigo!

On the receiving end, the form data needs to be decoded. Most servers

provide

an API for receiving decoded form data in your application, be it CGI
environment variables or getParameter() methods on HTTP request objects or
what have you. But since most browsers do not communicate the details of

what

encoding they used as the basis for the %-escaping, the server makes a

guess,

and usually guesses wrong. So for example, while

   %C2%81Hola%20amigo!

unambigously means bytes

   C2 81 48 6F 6C 61 20 61 6D 69 67 6F 21

...the API might mistakenly assume that these are iso-8859-1 and will

decode

it for you into the string "À¡Hola amigo!". In fact, this happens quite

often.

So you'll have to be prepared to transcode: re-encode the string using the
same encoding that the server assumed, and then decode it using the

encoding

that you know the HTML form used (you might send the latter in a hidden

form

field). Either that, or pull the raw data out of the HTTP request and

properly

decode it yourself.

Once you have the properly decoded string, you can feed it to an XML

parser as

a Unicode string, so that the parser will ignore the encoding declaration

in

the XML's prolog. If you were to feed the raw bytes (the C2 81 48 etc

above)

to the parser, you would have to declare the encoding externally, because
there's a chance that the declaration in the prolog has become innacurate
while it was edited and reencoded.

You didn't know what you were getting into, did you? Like I said, in

general,

HTML forms and the server-side APIs for processing them are just not

equipped

to be a general-purpose XML editing system, at least not in an idiot-proof
way. The culprits are really HTTP and MIME; HTML is just working around

their

restrictions. And browser vendors choose the path of least disruption,
choosing not to implement some of HTML's features that could easily work
around some of these issues (e.g., they do have a way of transmitting

encoding

info, but they just don't do it, to "keep people's scripts from

breaking").


--
  Mike J. Brown   |  http://skew.org/~mike/resume/
  Denver, CO, USA |  http://skew.org/xml/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list