John Macdonald wrote:
So, it looks like my memory is faulty - I had though that there was
at least one example that had text between a pair <xx> and </xx>
where the meaning of xx was such that the text should *not* be
considered for matching purposes. Unless the third line could be
written as:
Perl<footnote> Programming Perl, Wall et al., isbn="1-56592-149-6"
</footnote> is terrific
No, I think your memory is correct, since I remember it too. :-)
It may have been an example generated on the fly by Tim that wasn't
on a slide. I also remember that the gist of this part of the
discussion was trying to highlight that there were some basic
issues in XML parsing that the XML community isn't too clear
about yet. Like these.
I have to retract my objection. (But if there *can* be such
non-text, it is a problem.)
I think your objection stands, since no one has figured out how to
concisely describe the difference between
Perl <footnote>...</footnote> is teriffic
and
perl <footnote>...</footnote> is <emph>not</emph> teriffic
The problem is that you the reader place different meaning on
<footnote>, which you want to ignore, and <emph>, which you want
to retain. How do you tell the parser/re engine that you want
to ignore entire blocks of text based on the metadata which
describes them, yet keep other blocks of text based on its metadata?
Furthermore, how does anyone propose to do that without at least some
knowledge of the tagset in use?
Turning on metadata processing such that PIs and comments can be
ignored isn't simple, either. When deleting blocks of content
that contain comments, do the comments get deleted too? Do they get
marked as dirty and swept up/kept if desired? I could easily see
a case where I'd want to delete blocks of XML/HTML (including markup)
and keep the comments, so that I can later position replacement text
(and markup) relative to those comments, and then delete the
comments later.
It sounds like it might be time for a secondary RE engine that
handles metadata, but has LOTS of knobs and dials to twiddle
each of these both/neither/either kind of features at the
programmer's request. Perhaps metadata-aware REs must be compiled
into a scalar and then have their behaviors tweaked after the
pattern has been compiled. That way, you could say:
...
$re = qr/perl is teriffic/i;
$re->ignore_blocks(qw(footnote author inventor));
## metadata rules apply; 'footnote' is case insensitive.
$re->case_sensitive_tags(); ## override default behavior
$re->keep_subblocks();
...
-- Adam