xsl-list
[Top] [All Lists]

RE: [xsl] accessing the input XML's doctype

2008-07-17 07:51:31
Thanks everyone for your response.  

Darcy - Fortunately, I have the meat of the transform working (accepting
splits and joins, too).  The article looks interesting.  

David - I like the idea of default attributes, but ideally I want the
transform to be truly universal.  Maybe the transform could first check
for those attribute, and if they doesn't exist, use my current
plain-text parsing method.

Michael - Writing a custom SAX filter is a bit beyond my current
abilities, would be a good learning project when I have time.  

If I ever get anything more sophisticated or elegant working, I'll post
it to the list.

Thanks,

-James





-----Original Message-----
From: Michael Kay [mailto:mike(_at_)saxonica(_dot_)com] 
Sent: Wednesday, July 16, 2008 6:08 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: RE: [xsl] accessing the input XML's doctype

One thing you could try doing - I've had it in mind for years - is to
write
a filter between the XML parser and the XSLT processor, using SAX
interfaces, that gets notification of the DTD events from the parser and
translates them into things the XSLT processor understands, like
elements
and attributes in some special namespace.

This seems much cleaner architecturally than reading the document as
unparsed text and trying to parse it yourself.

Michael Kay
http://www.saxonica.com/ 

-----Original Message-----
From: James Sulak [mailto:jsulak(_at_)jonesmcclure(_dot_)com] 
Sent: 16 July 2008 20:40
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] accessing the input XML's doctype

Hello All,

I'm trying to write a transform that gives the output XML 
file the same document type as the input XML file.  
(Specifically, it's a transform to remove Arbortext Editor's 
change-tracking markup).  I'm not happy with the method I'm 
using now, namely regexing the input XML as an unparsed 
document to extract the public and system identifiers from 
the doctype declaration.

I have a fairly limited knowledge of how a XSLT processor (we're using
Saxon) interacts with the XML parser.  But my understanding 
is that the parser reads in the XML, resolves any default 
attribute values, and then passes the document tree to the 
XSLT processor.  The XSLT processor itself doesn't know or 
care about the doctype information.  Is this correct?

If it is, that would seem to imply that what I'm asking is 
impossible without writing an extension function.  Is this 
the case?  Since our implementation is already dependent on 
several Saxon extension functions, that's an acceptable 
solution.  Has anyone attempted anything like this, or have 
any suggestions on how to proceed?  Could I call Xerces (or 
another parser) from an extension function and get the public 
and system identifiers?

Here's the relevant part of my current method:

   <xsl:param name="doctype.public"
select="f:input-doctype(document-uri(.))[1]"/>
   <xsl:param name="doctype.system"
select="f:input-doctype(document-uri(.))[2]"/>

   <xsl:function name="f:input-doctype">
      <xsl:param name="document-uri"/>
      <xsl:variable name="unparsed-document"
select="unparsed-text($document-uri)"/>
      <xsl:variable name="regex">
         <xsl:text>DOCTYPE
                                 [\s]*
                                 ([a-zA-Z0-9]+)
                                 [\s]*
                                 PUBLIC
                                 [\s]*
                                 "(.+)"
                                 [\s]*
                                 "([0-9a-zA-Z/]+\.dtd)"
         </xsl:text>
      </xsl:variable>
      <xsl:analyze-string select="$unparsed-document" regex="{$regex}"
flags="msx">
         <xsl:matching-substring>
            <xsl:sequence select="regex-group(2), regex-group(3)"/>
         </xsl:matching-substring>
      </xsl:analyze-string>
   </xsl:function>

   <xsl:output method="xml" version="1.0" encoding="utf-8"/>

   <xsl:template match="/">
      <xsl:result-document doctype-public="{$doctype.public}"
doctype-system="{$doctype.system}">
         <xsl:apply-templates/>
      </xsl:result-document>
   </xsl:template>


Thanks,

-James


-----
James Sulak
Electronic Publishing Developer
Jones McClure Publishing




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--