perl-unicode

Re: In-Band Information Considered Harmful

1998-10-23 14:23:30
Chip writes:

I think instead we'd need new metadata escapes in the RE language.
Let's call them \m{X} to require metadata tag X, and \M{X} to forbid
tag X.  e.g.:

 /\m{italic}\m{bold}Yes!/

You'd also need to specify which meta-data layer you're combining
with the plaintext layer, assuming you're sticking with the one-or-more-
of-many-possible meta-data-layer model (whew!).

Note that those codes impose conditions on the following text, they do not
represent embedded codes (a la Ilya or WordPerfect).  [...]

Yeah, this is a great thing.  The unimportance of ordering is a big bonus.

The only thing I don't see as obvious in this scheme is how to access
the additional information associated with a tag when matching.
/\m{a}text/ for anchored /text/ is fine, but once you've found it, how
do you access the anchor HREF -- perhaps because you're only looking
for HREFs to perl.org?  It's possible that we won't be able to express
all that in the RE engine per se, and that we'll have to escape via
(?{}) and use the Perl language primitives.

how about
 ($url, plaintext) =~ /(\m{anchor})(text)/;

2.  "How do you know a meta-data layer is appropriate to, and
synchronized with, a given piece of content?"

How about if meta-data layers are not themselves unadorned 
monotonic bytestreams, but instead something like regular
expressions?

I appreciate your intent here, but I have a hard time imagining an
implementation that's robust enough to be trustworthy for
mission-critical purposes.

I lost the plot as soon as you said 'robust', and was in the ditch
at the time 'mission-critical' arrived on the scene.  Try it again?
In case the confusion is due to me, let me explain: regexp-style
meta-data layers use regexp syntax to avoid the anchoring
problem and the irrelevance problem.  They're as robust as
regular expressions currently are.  To wit, let's say I want to
adorn the phrase

"Santa Monica, California, is a sundrenched pleasure palace"

with my own links having to do with Santa Monica and palaces.
I'd be depressed if someone edited the text into Santa Clara,
or changed 'pleasure' to 'torture'.  Further, I'm not talking about
Santa Monica, New Zealand.

If I wanted to be definitively sure that my meta-data layer only
existed when the text was exactly as written, my meta-data layer
would be

/^(Santa Monica),California, is a sundrenched (pleasure) palace$"
with $1 being given a Santa Monica link, and $2 being given a
pleasure link.

This is about the same result as if you tried non-regexp methods
of anchoring meta-data to plaintext -- a fixed system which 
has to either invalidate itself on any error or make unreasonable
and occasionally incorrect guesses.

Let's suggest that I want to write a slightly more hardy meta-
data layer:

/(Santa Monica), California.*(pleasure) palace/

or even /(Santa Monica)/ && / (.?*) palace/ 
(made-up syntax to indicate non-specific ordering)

Here's a regexp-based meta-data layer that touches up all
e-mail addresses (incorrectly, but presented for illustration)

s/(.*) <(.?*)@(.?*\.+)(com|edu|org)/link:"mailto:$2(_at_)$3$4" $1/
# i.e. "F Gallo <fsg(_at_)blah(_dot_)com>" becomes
# <a href="mailto:fsg(_at_)blah(_dot_)com">F Gallo</a> in an HTML context

So in summary, a layer based on regexps is more agile in the
face of edits, can have a global effect without needing multiple
insert points, can be tied to content depending on context,
and can even modify the content in addition to its presentation.

 Xanadu had the luxury of being a
long-running server; Perl doesn't.

If you mean that you think this system has inefficiencies that
can be masked by having an extensive precompilation/study
phase, you may be right, but I'm not totally convinced yet that
they're of a scale that would render the system inadequate
without that phase.  After all, consider what we make do with
now to do the same kind of thing.

The problem would be importing and exporting layered text,

There would have to be excellent facilities for people writing
their own layered text filters.

Not to sound too in love with my incompletely thought out idea,
but that's an advantage of regexp-based meta-data layering or
of any system using current Perl features (text-as-associative-
array-of-streams is another idea) -- people are already familiar
with using the regexp engine to handle and analyze meta-data.

So it would be great if the perl builtins (<FILE>, print)
intuitively understood about meta-data and organized it
themselves.

If tied filehandles get more efficient -- and they'd better! --
then it'll be possible to do all that you suggest without making
changes to the behavior of the built-in operators.


Excellent point!  That's a much better way to do it.

On the same topic, could you add to the 'wanted' list hooks for
open(SPAM, "<http://whatever.spam.org";)
and open(SPAM, ">http://whatever.spam.org";);
?

F.