I think your objection stands, since no one has figured out how to
concisely describe the difference between
Perl <footnote>...</footnote> is teriffic
and
perl <footnote>...</footnote> is <emph>not</emph> teriffic
I wrote this a while ago:
I think Tim Bray mentioned that the XML working group was trying to come
up with a standard which would apply some sort of semantics to a DTD.
No matter what, I think that this semantic information would have to be
known on a per-scalar basis, not kept in the regexp. In terms of
matching against, say, /perl is terrific/, I can think of at least the
following meanings for each tag: Ignore tag, Ignore tag & contents of
tag, Require tag. For instance, if the string is "perl is
<emph>not</emph> terrific" you would want the <emph> tag to be ignored,
but not the contents. If the string is "perl<note>well, Perl,
actually</note> is terrific" you would want the <note> tag and its
contents to be ignored. I can't think of a good example for requiring
the tag to be explicitly matched, but it seems necessary for
orthogonality ?! Ignoring the partial document problem for a minute, I
could envision this working via magic on the scalar. Say that the
standard which indicates the semantic meaning of each tag is a Document
Semantics Definition (DSD):
my $dsd = new DSD 'foo.dsd'; # Indicates that <emph> tag is ignored
my $xml = "perl is <emph>terrific</emph>";
$dsd->apply($xml);
$xml =~ /perl is terrific/; # Matches!
Ducking the rocks,
Benjamin Holzman