Re: [xsl] Function for determining one XPath as subset of another

Thanks to everyone for their comments and suggestions :-) I am
cross-posting from xquery-talk my results for anyone who is
interested...

Okay so for those that are interested, here is the solution that I
ended up with:

1) I wrote my own XPath 2 parser as I wanted a simplified AST to start
operating from. The project I am working on needs to be under GPL2 and
I could not find a decent Java library that was compatible with that
license - https://github.com/exquery/xpath2-parser/

2) I then wrote some utility functions for descending through the Path
expressions and making subset comparisons -
https://github.com/adamretter/TEI-Completer/blob/master/src/main/java/org/humanistika/oxygen/tei/completer/XPathUtil.java#L79
Yes, this has many limitations and is most likely not complete yet,
but it serves me well for the small subset of XPath path expressions
that I want to support.

3) You can see in my test-cases how I consider one path expression as
a subset of another path expression:
https://github.com/adamretter/TEI-Completer/blob/master/src/test/java/org/humanistika/oxygen/tei/completer/XPathUtilTest.java#L40


On 26 January 2016 at 21:35, Liam R. E. Quin liam(_at_)w3(_dot_)org
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

On Tue, 2016-01-26 at 16:15 +0000, Adam Retter
adam(_dot_)retter(_at_)googlemail(_dot_)com wrote:

Given two simple XPaths, say:

1. //w

2. /x/y/z/w[@a = 'v']
[...]
I wondered if there might be an algorithm or library that someone
already had or has written which might be able to give me the answer?


There's quite a bit of literature on optimizing XPath in the context of
databases (e.g. in proc. VLDB, in "XML-Based Data Management and
Multimedia Engineering" and elsewhere) but I don't know of a single
coherent summary.

This may be at least in part because what's useful and practical
depends on the implementation and the situation.

For example, a system with an element occurrence index might solve //w
in O(n_w) time, where n_w is the number of occurrences of w elements in
the corpus being searched; one using a tree structure might do better
with the explicit path.

A path like //a//b//c//d is much harder using tree navigation, but
rewritten as d[ancestor::c[ancestor::b[ancestor::a]]] against an
element index will likely run in time close to O(n_d log D) where D is
the average depth of the tree. A system smart enough to notice that
there are only three "c" elements in the corpus could do better than
O(n_d) though (my text retrieval system did/does that to speed up
multi-word phrase searches, for example).

We encountered each of these implementation strategies during the
development of XQuery.  The cost of building an element index for a
non-database-backed XPath implementation is in some systems amortized
by doing it during parsing, while the input XML is being read; systems
that run XPath against a stored or in-memory tree might not have that
luxury, or might have the even greater luxury of a precomputed index.

I don't know of anything off the shelf, and although open source XQuery
implementations might be a place to look, it's likely that
optimizations are done on an internal representation of the XPath
expression.


I realise that I can only probably cover a subset of XPath itself,
but
it is only the path steps with predicates which I am interested in.

Ideally I am looking for something in Java.




-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--