Re: [xsl] Safe-guarding codepoints-to-string() from wrong input

Andrew Welch wrote:


If you are receiving strings containing literal control characters
then they're almost definitely encoded in Windows-1252 - just parse
them using that and you'll be ok.

No, that's not it. The codepoints are encoded using literal numerichexadecimal strings (compare 
 in XML, which would be [0A] in theoriginal example)


If the string contains control characters as character references,
then its a bit harder because the references get expanded using
unicode codepoints, and not those specified in the Windows-1252
mappings...  So you need to parse/serialize the string to expand the
references (I personally use JTidy with the CharEncoding set to
Configuration.RAW which forces the Tidy to output the bytes instead of
a reference)

Its a pain....


Well, that's encouraging ;)

The project contains strings that are "escaped" in several ways (textsare literal):

C-style:     \x0ASome text \x22between quotes\x22
Local style: <0A>Some text <22>between quotes<22>
Other style: Text with <22,24,54> multiple special chars
XML-like:    &0A;Some text &22;between quotes&22;

In short: the input is rubbish. But we know for a fact how to get thecodepoints. However, in the past, users have made mistakes. The originalapplication simply ignored those mistakes, replacing the illegalcodepoints with nothingness.


The good news is: all codepoints are Unicode codepoints.

Thanks,
-- Abel






--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--