xsl-list
[Top] [All Lists]

Parsing plain text - xml application specifying parser

2005-09-11 00:23:02
Hi.

Is it feasible to use to specify a parser that, when
translated into XSLT 2.0, turns plain text into xml
according to the specification? Is something like this
expected for XSL 3.0, skipping the use of a separate
XML application?

My searches on google and through this list's archives
didn't provide me any information on this approach. My
next step is to hack at it myself, but my knowledge of
how parsers work is minimal. If XSLT can mimic a
parser, though, this might work as a two-step process:

parser_specification.xml + parser_application.xsl ->
parser.xsl 

parser.xsl + plain.txt -> fully_parsed.xml

Maybe the plain.txt is accessed through xpath's
document() function applied to a parameter passed to
the parser.xsl file when its processed. 

This idea was sparked for me by:
- reading an online article (that I've lost) that
discusses how an xml file preserves the parse tree
using its tags
 - Michael Kay's writing (in his XSLT 2.0 Programmer's
Reference) about analyzing plain text for hidden
structure using XSLT 2.0 regex.

It seems like a natural fit to me that XSLT could do
this directly, turning plain text into XML without
difficulty. I wouldn't be surprised if this approach
(or something much better) is slated for a later XSLT
release.

In case it helps explain what I mean, below is an
artificial example parser_specification.xml file that
transforms an input plain.txt file into a
fully_parsed.xml file. I'm just a student, not a
programming expert. If the example is raw or just
plain awful, sorry.

Anyway, I'll appreciate any information that anyone
can provide.

-Noah
-------------------------------------------------------


-----------parser_specification.xml--------------------
<?xml version="1.0"?>
<specification ignore-white-space="yes">

<first-rule name="entities">
<either_or><rule name="identifier_listing"
/><or/><rule name="descriptor_listing" /></either_or>
</first-rule>

<rule name="identifier_listing">Each <rule
name="entity" /> is identified by <optional>the
combination of</optional><rule name="descriptors"
/><optional> and <rule name="descriptors"
/></optional>
</rule>

<rule name="descriptor_listing">About each <rule
name="entity" />, we can remember <rule
name="descriptors" count="1+" /><optional> and <rule
name="descriptors" /></optional>
</rule>

<rule name="descriptors" tag-output="no">
its <rule name="descriptor" count="1"
tag-output="yes"/><either_or>,<or/>.</either_or>
</rule>

<rule name="descriptor">
<either_or><rule name="entity" or-preference="1"
/><or/><rule name="attribute" or-preference="2"
/></either_or>
</rule>

<rule
name="entity"><either_or>cow<or/>herd<or/>farm<or/>herd-owner<or/>farm-owner</either_or>
</rule>

<rule name="attribute"><regex value="\w[[:alnum:]*\w"
/>
</rule>

</specification>
-------------------------------------------------------


----------------------plain.txt------------------------
About each cow, we can remember its name, its breed,
its weight, and its herd.
Each cow is identified by the combination of its name,
and its herd.
About each herd, we can remember its name, its
herd-owner, and its farm.
Each herd is identified by the combination of its
name, and its farm.
About each farm we can remember its farm-owner, its
name, and...
.
.
.
-------------------------------------------------------


------------------fully_parsed.xml---------------------
<?xml version="1.0"?>
<entities>

<descriptor_listing>About each <entity>cow</entity>,
we can remember its <attribute>name</attribute>, its
<attribute>breed</attribute>, its
<attribute>weight</attribute>, and its
<entity>herd</entity>.
</descriptor_listing>

<identifier_listing>Each <entity>cow</entity> is
identified by the combination of its
<attribute>name</attribute>, and its
<entity>herd<entity>.</identifier_listing>

<descriptor_listing>About each <entity>herd</entity>,
we can remember its <attribute>name</attribute>, its
<entity>herd-owner</entity>, and its
<entity>farm</entity>.</descriptor_listing>

<identifier_listing>Each <entity>herd</entity> is
identified by the combination of its
<attribute>name</attribute>, and its <entity>farm
</entity>.</identifier_listing>

<descriptor_listing>About each <entity>farm</entity>
we can remember its <entity>farm-owner</entity>, its
<attribute>name</attribute>, and 
.
.
.
</entities>
-------------------------------------------------------


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



<Prev in Thread] Current Thread [Next in Thread>