Re: [xsl] Text based stage play scripts to XML

On Mon, 2011-01-24 at 14:37 +0200, Jacobus Reyneke wrote:

Take any input file and output a similar output file. While doing so
however, look for text located between identifiable patterns. Surround
this text with tags.

If input file contains:
a b c d e f g h i j

Pattern description:
any string that follow after the string "c d" and is followed by the
string "g h"

If pattern found:
Surround with <found-you>

Result:
a b c d<found-you> e f </found-you>g h i j


Others have mentioned some XSLT approaches, and that's generally a good
way to go.  Of course, if you don't mind learning a programming
language, Perl is the king (or at least a princess) of transformations
where you don't yet have XML, but want to add markup. Use XML-aware
tools as early in the process as possible, though!

while (<>) { # for each line of input
    s{c d\K e f (?=g h)}{ # replace with the value of...:
        element(
            "found-you",  # element name
            $&,           # what was matched (" e f " here)
            # optional attributes:
            "rule" => "31",
            "before" => "c d"
        )
    }e;  # "e" flag means the replacement is an expression, not text

    print; # print the line whether or not it was changed
}

Given the input a b c d e f g h
this produces
a b c d<found-you rule="31" before="c d"> e f </found-you>g h

To process a whole file at once, you can use the rather odd Perl idiom,
my $text { 
    local $/; # slurp mode
    $text = <>;
};

# and then do the substitution:
$text =~ s{as before}{as before}gme;

At that point you might (or might not) want to use \s+ rather than a
space between the tokens in the input, to match one or more whitespace
characters.  Start by normalizing the text though -- look for lines
ending with spaces, for example, and trim them.

Adding an attribute showing which pattern put a tag in place can
considerably aid debugging the process.  It also helps to be consistent
in your markup, e.g. *always* use double quotes for attribute values.

A simple definition of the "element" function follows - I have tried to
avoid "clever" Perl, and I have left a couple of items in place that
help debugging.  For production it would probably also handle quoting
special characters (& < > in content) as well as (already done) " in
attribute values.

It's relatively straight forward using this approach to get files that
can be processed further with XML tools, although even then I sometimes
use Perl, e.g. because of its more powerful regular expressions, or
because I can more easily check for filenames...

You could have a separate file of patterns that are loaded and matched
against. On Linux, run the command, perldoc perlre, for some
documentation.

Liam

#! /usr/bin/perl -w
use warnings;
use strict;

sub element($$;%)
{
    my ($name, $content, %attributes) = @_;

    sub quotedattvalue($$)
    {
        my ($name, $value) = @_;

        # print STDERR "q $name, $value\n";
        $value =~ s/"/\&quot;/g; # so we can safely use quotes
        return '"' . $value . '"';
    }

    # make a list of att="value" pairs, each with a leading space:
    # (could use join and map to do this too more succinctly,
    # see perldoc -f map)
    my $atts = "";
    if (%attributes) {
        foreach (keys %attributes) {
            $atts .= " " .
                $_ . '=' .  quotedattvalue($_, $attributes{$_})
            ;
        }
    }

    return "<${name}${atts}>${content}</${name}>";
}

my $text;
{
    local $/;
    $text = <>;
};

$text =~ s{c d\K e f (?=g h)}{
        element(
            "found-you",
            $&,
            "rule" => "31",
            "before" => "c d"
        )
    }gme;
    print $text;

# end

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--