Hello everybody,
I have the following problem:
I need to find any one and two-character words in my
document, like "L", "GG", "Bz", "mm", but also entities
representing a character like this "∇" (up-turned
delta) etc. As any such combination is possible, this
would make a very long list. Once found, I'd like to
surround this strings with an element each, like this:
<sym>L</sym>, <sym>GG</sym>, <sym>∇</sym> ...
Furthermore, I am not interested in these character
sequences when they are found inside certain elements, for
example: <xref refid="abc">Part B</xref>, I do not want to
tag the "B". There's a limited number of such exclusions.
My understanding is that handling this in XSLT (1.0, at
least) is not possible. I cannot currently switch to 2.0,
so I thought the best way would be to use regular
expressions (as an ant task) that accomplish the same
goal.
While I have no trouble creating a regex that finds me all
one or two-character words, I have not found a way yet to
express the contextual constraints.
The following is a "pseudo regex" expressing this idea:
------8<------
not following <xref[^>]+> or <syd1>[^>]+> ...
(.*)
< ==> word start
([a-zA-Z] | [a-zA-Z][a-zA-Z]) ==> target
> ==> word end
(.*)
not before </xref> or </syd1> ...
------8<------
Again, I am conscious this can be regarded as off-topic.
And also, if there's an XSL-based solution, or a different
approach altogether, I am happy to learn.
Thanks in advance.
Cheers,
Jakob.