xsl-list
[Top] [All Lists]

Re: [xsl] text extraction

2006-10-12 09:05:50
On 10/12/06, Abel Braaksma <abel(_dot_)online(_at_)xs4all(_dot_)nl> wrote:
Andrew Welch wrote:
> On 10/12/06, mus47(_at_)voila(_dot_)fr <mus47(_at_)voila(_dot_)fr> wrote:
>> And also I want to now how can the output file encoding setted to
>> iso8859-1 instead of utf8.
>> I use the xsltproc tool.
>
> You can set the output encoding using <xsl:output/>

But it is not guaranteed that the processor supports anything different
from UTF-8/UTF-16.

Are you sure?  Interestingly the spec states:

"The value of the encoding attribute provides the value of the
encoding parameter to the serialization method. The default value is
implementation-defined, but in the case of the xml and xhtml methods
it must be either UTF-8 or UTF-16."

(http://www.w3.org/TR/xslt20/#element-output)

...which took me a little by surprise - It seems to say that when the
output method is xml or xhtml the encoding MUST be either UTF-8 or
UTF-16?  Saxon doesn't seem to mind...

Also note, the first 127 codepoints when encoded as ISO-8859-1 or UTF-8
are exactly equal. Only ISO 128 (sometimes euro sign, but you may see
something different: €) and above are treated differently.

Note that ISO-8859-1 is an order of magnitude smaller then UTF-8, so you
may end up with missing or replaced characters (not sure what they will
be replaced with though, when they don't exist) in the output stream.

No you dont end up with missing or replaced characters... Any
characters not in the encoding should be output as a character
reference.  Its a well known technique to use an output encoding of
US-ASCII so that all non-ascii characters get output as character
references, which gets around read encoding problems further down the
pipe.

cheers
andrew

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>