perl-unicode

Re: In-Band Information Considered Harmfult

1998-10-23 19:47:39
Adam Turoff writes:
The problem is that you the reader place different meaning on
<footnote>, which you want to ignore, and <emph>, which you want
to retain.  How do you tell the parser/re engine that you want
to ignore entire blocks of text based on the metadata which
describes them, yet keep other blocks of text based on its metadata?

That is easy.  This is not going to be Perl problem.  This is the
input filter problem.  The input filter can convert a "group" into

      a) either markup which sits *between* chars, thus is (mostly)
         invisible to text handling primitives of Perl;

      b) or markup "covering" a substring of a string;

Say, footnote may be converted to any one of the markups, comment
should better be converted to "a" (but see below).

Furthermore, how does anyone propose to do that without at least some
knowledge of the tagset in use?  

After importation you do not need any such image - unless you want to
translate between two forms of, say, a footnote.  But again, this will
be done not by Perl, but by a user subroutine.

Turning on metadata processing such that PIs and comments can be 
ignored isn't simple,  either. When deleting blocks of content 
that contain comments, do the comments get deleted too?

In EText this is a difference between between-chars markup (which
survives deletion), and zero-length "covering" markup (which goes
away).

It sounds like it might be time for a secondary RE engine that 
handles metadata, but has LOTS of knobs and dials to twiddle 
each of these both/neither/either kind of features at the 
programmer's request.  

I do not think so.  Multiargument form of \M{marker,foo=xxx} (which
matches "between-char"=markers only, and only those which have a field
'foo' in the associated hash being 'xxx') should cover most needs.

Then the only real knowledge of problem domain is needed by
import-output filters, and - possibly - modifiers of the format.
These modifiers will RExen with appropriate \M escapes.

Perhaps metadata-aware REs must be compiled
into a scalar and then have their behaviors tweaked after the 
pattern has been compiled.  That way, you could say:
      ...
      $re = qr/perl is teriffic/i;
      $re->ignore_blocks(qw(footnote author inventor));  
      ## metadata rules apply; 'footnote' is case insensitive.
      $re->case_sensitive_tags();     ## override default behavior
      $re->keep_subblocks();

No need to.  With both mine and Chip's proposals (as I explained many
times, there is absolutely no difference from the semantic point of
view) /perl is teriffic/i should be able to match exactly what the
designer of the filter intends.

Ilya