In a block off the old Chip:
I think instead we'd need new metadata escapes in the RE language.
Let's call them \m{X} to require metadata tag X, and \M{X} to forbid
tag X. e.g.:
/\m{italic}\m{bold}Yes!/
I think a better solution would be to add a metadata-matching modifier
('/t' for 'tags' was one possibility that came to mind) that would allow
the familiar (SG|HT|X)ML tags to be overloaded as metadata specifications.
Anything in <...> gets interpreted as a match required on metadata for the
text segment. Real '<'s and '>'s would need to be escaped.
For example (off the top of my head):
<foo> # must be in a <foo>...</foo>
<?foo> # may be in a <foo>...</foo>
<!foo> # must not be in a <foo>...</foo>
<-foo> # end of requirements for <[?!]?foo> ('-foo' isn't
# very good, but '/foo' is problematic in regexen, so
# I'll use this for now)
Chip's example could thus be written as:
/<italic><bold>Yes!/t # require italic and bold
/<?italic><bold>Yes!/t # require bold, italic optional
/<!italic><?bold>Yes!/t # optionally bold, not italic
Tags could also be combined for brevity:
<foo|bar>...<-foo>...<-bar>
Other examples:
/<ul><?*>.*<-ul>/t; # match anything as long as it's inside
# <ul>..</ul>
/<quote>This is <?bold|italic>simple<-bold|italic><-quote>, he said/t;
# "This is <bold>simple</bold>", he said
# "simple" in bold, italic, both or none
/<!head><title>.*</title>/t; # match <title>...</title> but not
# in the <head>...</head> section
The advantage (assuming we could iron out the details of the syntax) would
be that the pattern itself would be visually representative of the intended
matching text.
"<head>blah, blah, blah</head>" =~ /<head>.*/t;
vs
"<head>blah, blah, blah</head>" =~ /\m{head}.*/;
It works exceptionally well if you use "</head>" instead of "<-head>",
as long as you don't mind picking different regex quote characters:
m[<head>.*</head>]t
These requirements could also be specified via method calls to compiled
regexen:
my $regex = qr/This is simple/;
$regex->mandatory();
$regex->optional( qw(bold italic emph quote a img) );
$regex->prohibited( qw(untruth) );
The only thing I don't see as obvious in this scheme is how to access
the additional information associated with a tag when matching.
/\m{a}text/ for anchored /text/ is fine, but once you've found it, how
do you access the anchor HREF -- perhaps because you're only looking
for HREFs to perl.org?
Perhaps something like:
my $perlorg = qr/\.perl\.org/;
/<a href=~$perlorg>.*<-a>/t;
which returns the content of any "<a href=URL>...<a>" where URL matches
the perl.org regex supplied. Attribute matching could be acheieved via
embedded regexen, callbacks, etc.
A
--
Andy Wardley <abw(_at_)kfs(_dot_)org> Signature regenerating. Please remain
seated.
<abw(_at_)cre(_dot_)canon(_dot_)co(_dot_)uk> For a good time:
http://www.kfs.org/~abw/