perl-unicode

Re: In-Band Information Considered Harmful

1998-10-26 05:16:11
In a block off the old Chip:

I think instead we'd need new metadata escapes in the RE language.
Let's call them \m{X} to require metadata tag X, and \M{X} to forbid
tag X.  e.g.:

   /\m{italic}\m{bold}Yes!/

I think a better solution would be to add a metadata-matching modifier
('/t' for 'tags' was one possibility that came to mind) that would allow
the familiar (SG|HT|X)ML tags to be overloaded as metadata specifications.
Anything in <...> gets interpreted as a match required on metadata for the
text segment.  Real '<'s and '>'s would need to be escaped.

For example (off the top of my head):

    <foo>        # must be in a <foo>...</foo>
    <?foo>       # may be in a <foo>...</foo>
    <!foo>       # must not be in a <foo>...</foo>
    <-foo>       # end of requirements for <[?!]?foo> ('-foo' isn't
                 # very good, but '/foo' is problematic in regexen, so
                 # I'll use this for now)

Chip's example could thus be written as:

    /<italic><bold>Yes!/t     # require italic and bold
    /<?italic><bold>Yes!/t    # require bold, italic optional
    /<!italic><?bold>Yes!/t   # optionally bold, not italic

Tags could also be combined for brevity:

    <foo|bar>...<-foo>...<-bar>

Other examples:

    /<ul><?*>.*<-ul>/t;           # match anything as long as it's inside
                                  # <ul>..</ul>

    /<quote>This is <?bold|italic>simple<-bold|italic><-quote>, he said/t;
                                  # "This is <bold>simple</bold>", he said
                                  # "simple" in bold, italic, both or none

    /<!head><title>.*</title>/t;  # match <title>...</title> but not
                                  # in the <head>...</head> section

The advantage (assuming we could iron out the details of the syntax) would
be that the pattern itself would be visually representative of the intended
matching text.

    "<head>blah, blah, blah</head>" =~ /<head>.*/t;
vs
    "<head>blah, blah, blah</head>" =~ /\m{head}.*/;

It works exceptionally well if you use "</head>" instead of "<-head>",
as long as you don't mind picking different regex quote characters:

    m[<head>.*</head>]t

These requirements could also be specified via method calls to compiled
regexen:

    my $regex = qr/This is simple/;
    $regex->mandatory();
    $regex->optional( qw(bold italic emph quote a img) );
    $regex->prohibited( qw(untruth) );

The only thing I don't see as obvious in this scheme is how to access
the additional information associated with a tag when matching.
/\m{a}text/ for anchored /text/ is fine, but once you've found it, how
do you access the anchor HREF -- perhaps because you're only looking
for HREFs to perl.org?

Perhaps something like:

    my $perlorg = qr/\.perl\.org/;

    /<a href=~$perlorg>.*<-a>/t;

which returns the content of any "<a href=URL>...<a>" where URL matches
the perl.org regex supplied.  Attribute matching could be acheieved via
embedded regexen, callbacks, etc.


A



-- 
Andy Wardley <abw(_at_)kfs(_dot_)org>   Signature regenerating.  Please remain 
seated.
     <abw(_at_)cre(_dot_)canon(_dot_)co(_dot_)uk>   For a good time: 
http://www.kfs.org/~abw/