On Mon, 2011-01-24 at 14:37 +0200, Jacobus Reyneke wrote:
Take any input file and output a similar output file. While doing so
however, look for text located between identifiable patterns. Surround
this text with tags.
If input file contains:
a b c d e f g h i j
Pattern description:
any string that follow after the string "c d" and is followed by the
string "g h"
If pattern found:
Surround with <found-you>
Result:
a b c d<found-you> e f </found-you>g h i j
Others have mentioned some XSLT approaches, and that's generally a good
way to go. Of course, if you don't mind learning a programming
language, Perl is the king (or at least a princess) of transformations
where you don't yet have XML, but want to add markup. Use XML-aware
tools as early in the process as possible, though!
while (<>) { # for each line of input
s{c d\K e f (?=g h)}{ # replace with the value of...:
element(
"found-you", # element name
$&, # what was matched (" e f " here)
# optional attributes:
"rule" => "31",
"before" => "c d"
)
}e; # "e" flag means the replacement is an expression, not text
print; # print the line whether or not it was changed
}
Given the input a b c d e f g h
this produces
a b c d<found-you rule="31" before="c d"> e f </found-you>g h
To process a whole file at once, you can use the rather odd Perl idiom,
my $text {
local $/; # slurp mode
$text = <>;
};
# and then do the substitution:
$text =~ s{as before}{as before}gme;
At that point you might (or might not) want to use \s+ rather than a
space between the tokens in the input, to match one or more whitespace
characters. Start by normalizing the text though -- look for lines
ending with spaces, for example, and trim them.
Adding an attribute showing which pattern put a tag in place can
considerably aid debugging the process. It also helps to be consistent
in your markup, e.g. *always* use double quotes for attribute values.
A simple definition of the "element" function follows - I have tried to
avoid "clever" Perl, and I have left a couple of items in place that
help debugging. For production it would probably also handle quoting
special characters (& < > in content) as well as (already done) " in
attribute values.
It's relatively straight forward using this approach to get files that
can be processed further with XML tools, although even then I sometimes
use Perl, e.g. because of its more powerful regular expressions, or
because I can more easily check for filenames...
You could have a separate file of patterns that are loaded and matched
against. On Linux, run the command, perldoc perlre, for some
documentation.
Liam
#! /usr/bin/perl -w
use warnings;
use strict;
sub element($$;%)
{
my ($name, $content, %attributes) = @_;
sub quotedattvalue($$)
{
my ($name, $value) = @_;
# print STDERR "q $name, $value\n";
$value =~ s/"/\"/g; # so we can safely use quotes
return '"' . $value . '"';
}
# make a list of att="value" pairs, each with a leading space:
# (could use join and map to do this too more succinctly,
# see perldoc -f map)
my $atts = "";
if (%attributes) {
foreach (keys %attributes) {
$atts .= " " .
$_ . '=' . quotedattvalue($_, $attributes{$_})
;
}
}
return "<${name}${atts}>${content}</${name}>";
}
my $text;
{
local $/;
$text = <>;
};
$text =~ s{c d\K e f (?=g h)}{
element(
"found-you",
$&,
"rule" => "31",
"before" => "c d"
)
}gme;
print $text;
# end
--
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--