Re: In-Band Information Considered Harmfult

John Macdonald wrote:

So, it looks like my memory is faulty - I had though that there was
at least one example that had text between a pair <xx> and </xx>
where the meaning of xx was such that the text should *not* be
considered for matching purposes.  Unless the third line could be
written as:

    Perl<footnote> Programming Perl, Wall et al., isbn="1-56592-149-6"
    </footnote> is terrific


No, I think your memory is correct, since I remember it too.  :-)

It may have been an example generated on the fly by Tim that wasn't 
on a slide.  I also remember that the gist of this part of the 
discussion was trying to highlight that there were some basic 
issues in XML parsing that the XML community isn't too clear
about yet.  Like these.

I have to retract my objection.  (But if there *can* be such
non-text, it is a problem.)


I think your objection stands, since no one has figured out how to 
concisely describe the difference between
        Perl <footnote>...</footnote> is teriffic
and
        perl <footnote>...</footnote> is <emph>not</emph> teriffic

The problem is that you the reader place different meaning on
<footnote>, which you want to ignore, and <emph>, which you want
to retain.  How do you tell the parser/re engine that you want
to ignore entire blocks of text based on the metadata which
describes them, yet keep other blocks of text based on its metadata?

Furthermore, how does anyone propose to do that without at least some
knowledge of the tagset in use?  

Turning on metadata processing such that PIs and comments can be 
ignored isn't simple,  either. When deleting blocks of content 
that contain comments, do the comments get deleted too?  Do they get 
marked as dirty and  swept up/kept if desired?  I could easily see 
a case where I'd want to delete blocks of XML/HTML (including markup) 
and keep the comments, so that I can later position replacement text
(and markup) relative to those comments, and then delete the 
comments later.

It sounds like it might be time for a secondary RE engine that 
handles metadata, but has LOTS of knobs and dials to twiddle 
each of these both/neither/either kind of features at the 
programmer's request.  Perhaps metadata-aware REs must be compiled
into a scalar and then have their behaviors tweaked after the 
pattern has been compiled.  That way, you could say:
        ...
        $re = qr/perl is teriffic/i;
        $re->ignore_blocks(qw(footnote author inventor));  
        ## metadata rules apply; 'footnote' is case insensitive.
        $re->case_sensitive_tags();     ## override default behavior
        $re->keep_subblocks();
        ...

-- Adam