xsl-list
[Top] [All Lists]

Re: [xsl] text replacement with mixed content

2011-08-31 10:56:00
On Wed, 2011-08-31 at 13:23 +0200, Geert Bormans wrote:
[...]
- there will be no tags inside words, though I have found non 
breaking spaces and soft hyphens at unpleasant locations
something I have to take into account when I dynamically generate the 
regular expression
- there will be no matching across paragraphs (I can rely on some 5 
or 6 elements that can have patterns to be matched, but will bear 
them completely)

In doing document up-conversion from plain text/OCR output to XML, I
tend to use a mix of languages - this particular problem is more about
the text than the markup, and I'd probably use Perl rather than XSLT.
However, I *would* run xml validation (or at least well-formedness
checking) on the output!

Some techniques that may help --

Temporarily removing markup:
    $text =~ s{
       (<[^>]+>)
    }{
       hide($1)
    }xeg;

where hide() is a function like this:

    my @stash;

    sub hide($)
    {
        my ($input) = @_;

        push @stash, $input;
        return "///' . $#input . '___;
    }

    and of course we can restore the markup like this:

    $text =~ s{
       ///(\d+)___
    }{
       $stash[$1]
    }xeg;

(it's a good idea to check that ### does not occur in the input first,
and also that it does not occur in the output!)

Now you can handle phrases easily, sine you have the constraint that
tags don't occur in the middle of words:

    $text =~ s{
      \b # match only at a word boundary
        (    # save in a group
          black
          (?: # non-capturing group
              (?:///\d+___)  # a hidden tag
             | \s
          )+  # so, any amount of space or hidden tags
          socks
         )\b
    }{
      elem("glamorous", $1)
    }xeg;

and then do the unhiding.

Here, elem is a function to make an XML element:
sub elem($$)
{
    my ($name, $content) = @_;

    return "<name>" . $content . "</name>";
}

You could use
    {<glamorous>$1</gamorous>} as the replacement instead.


The same techniques work in other languages, although Perl's regular
expressions have the advantage of (1) the "x" flag, allowing whitespace
and comments, and (2) the "e" flag, allowing expressions instead of
text.

The command, perldoc perlre, on Linux, OS X and Unix, gives (a lot) more
information, although you may need to install perl-doc.

I find that doing this sort of change in XSLT or XQuery can lead to a
lot of confusion, but I'm not as clear-thinking as some of the others on
this list, I suppose.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>