Aw: Re: [xsl] Using 'collection'

Mark --

Just to make sure I'm understanding the problem right, you want to
extract two particular elements (<foo> and <bar>) from a large number
of large files. If I've got the problem wrong, you can stop reading
here :-)


First, nothing I say is to suggest that previous suggestions (XQuery
against an XML database and collection() using saxon extension or
streaming to avoid memory gobble) are bad in any way. They may well
be better solutions than the following, but especially if this is a
one-time extraction, you may find a command line XPath tool (or
XMLsh) very helpful.

XMLsh is a pretty complete XML processing environment that works like
a typical unix shell. I have not used it much, but I know it includes
the capability needed here. See http://www.xmlsh.org/ if interested.

There are a variety of commandline XPath utilities available that run
in your normal shell (e.g., bash). My favorite is `xmlstarlet`
(invoked with `xml` on some systems), so I'll use it as an example.
Here used inside the bash shell.

  $ cd /path/to/dir/with/8000/xml/files/
  $ xmlstarlet -t -m "//foo|//bar" -c "." -n *.xml > all-foo-and-bar.txt

That command says "run an XSLT program that has a template (-t) that
matches all <foo> and <bar> (-m) and, for each, spit out a copy of
the element you matched (-c) followed by a newline (-n)". Notice the
output file is ".txt". That's because it's not XML, it has multiple
elements at the top-most level, and thus is not well-formed. If you
just add a wrapper element by hand, you get XML. (It is easy to get
xmlstarlet to wrap the <foo>s and <bar>s from a given file with an
element, even one that gives you the filename:

  $ xmlstarlet -t -e file -a fn -f -b
               -m "//foo|//bar" -c "." -n *.xml > afab.txt

which adds "start with an element <file> (-e) that has an attribute
@fn (-a) that has a value of the current input file's path (-f)".
(The '-b' is a break that says "this is the end of the attribute
definition".) But if you can add a wrapper element around all the
output, I don't know how.

The program also has namespace support:

  $ xmlstarlet -N me=http://www.example.edu/SB/ns
               -N you=http://www.example.org/MW/ns
               -t -m "//me:foo|//you:bar" -c "." -n *.xml > afab.txt

But (AFAIK), there is no default namespace. (I.e., you're in XSLT 1.0
land, here. Which is, in fact, the case -- I think xmlstarlet just
converts the commandline into a small XSLT 1.0 pgm and runs it.)

And, of course, you have full XPath 1.0 power in there. So if a <bar>
might be inside <foo>, and you don't want duplicates:

  $ xmlstarlet -t -m "//foo|//bar[not(ancestor::foo)]"
               -c "." -n *.xml > afab.txt

And, if instead of getting a copy you just want the ID followed by a
colon, a space, and the text value:

  $ xmlstarlet -t -m "//foo|//bar"
               -v "@xml:id" -o ": " -v "normalize-space(.)"
               -n *.xml > afab.txt

You get the idea. Just the way I used to use Perl on the commandline
constantly for throw-away one-liners to manipulate plain text (and
still do, occasionally) I can use an XPath commandline tool to
manipulate XML.

HTH.

P.S. The output usually has namespace declarations on every element,
     which I often don't want. Thus I often pipe the output through
     | perl -pe 's; xmlns(:[A-Za-z0-9._-]+)?=[^ \t\n\r>]+;;g;'
     In fact I do that so often, I have that perl step aliased to the
     simple-to-write "nons" in my .bashrc file.
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--