xsl-list
[Top] [All Lists]

Re: [xsl] Character 150 withs Windows-1252 output

2006-04-20 13:52:35
I should probably expand a little.

An author has intended to write a dash in a document that exists in a
CMS and gets saved as XML.  How the author has written that dash I
don't know (probably a cut and paste from somewhere), but the actual
bytes in the XML file show that the character 150 is in there:

foo–bar

In between foo and bar is the character 0x96 (#150) - I don't know how
it will be handled by the mailers.

From what I've read, character references are always resolved using
unicode code points, so #150 becomes "START OF GUARDED AREA", which is
part of the C1 controls, which are allowed in 1.0 but must be escaped
in 1.1.

So at this point the authors dash is no more - when the XML is parsed
the dash has become a non-displayed control character, that does
nothing.  Confusingly though, if it gets serialised back out as #150
both IE and Firefox render it as a dash - even for XHTML...

Compounding the issue the XML prolog states the encoding to be
ISO-8859-1 (which doesn't contain #150) whereas I think the actual
encoding is Windows-1252 (which does contain #150) - I would've
expected the parser to complain about the byte not being in the
encoding, but it seems fine with it.

I can deal with the #150 by replacing it with #8211, as I think it's
safe to assume that anywhere #150 is used in an XML document the real
intention was for #8211, and then get it fixed at source, so there
isn't really a problem, just lots of confusion (which is par for the
course when it comes to encoding)

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--