Re: In-Band Information Considered Harmful

Chip writes:

You'd also need to specify which meta-data layer you're combining
with the plaintext layer ...


I wonder if that belongs on the left side of =~ somehow.  After all, a
given combination of data and metadata may be the target of a search,
but it also may be flattened or modified or ...


Yeah, that's a better place for it than on the right side.  If only
Perl had first class objects.

Text =~ ...

The only thing I don't see as obvious in this scheme is how to access
the additional information associated with a tag when matching.
/\m{a}text/ for anchored /text/ is fine, but once you've found it, how
do you access the anchor HREF -- perhaps because you're only looking
for HREFs to perl.org?  It's possible that we won't be able to express
all that in the RE engine per se, and that we'll have to escape via
(?{}) and use the Perl language primitives.


how about
 ($url, plaintext) =~ /(\m{anchor})(text)/;


Could you unpack that for me?  I don't get your meaning.


A 'captured' part of a regular expression that contains metadata passes
the metadata as-is, so $url contains a reference to the HREF.

I lost the plot as soon as you said 'robust', and was in the ditch
at the time 'mission-critical' arrived on the scene.  Try it again?
In case the confusion is due to me, let me explain: regexp-style
meta-data layers use regexp syntax to avoid the anchoring
problem and the irrelevance problem.  They're as robust as
regular expressions currently are.


Yes, exactly.  A regex that fails in the middle leaves you without any
recourse for the remainder of the text.  So changing something in the
middle of the string may entirely destroy the ability to reattach
metadata from that point forward.  I don't consider that kind of
fragility acceptable.


Oh, I get it, and I see what you mean.  But, doesn't a regexp at least
have a fighting chance at succeeding for the remainder of the text,
where purely positional meta-data arrangements nearly certainly
do not?  The only difference between a regexp system and a
positional system is the possibility for it to anchor flexibly.

One can also imagine a layer comprising a set of many regular
expressions, rather than just one (my examples have been lame and
unfortunate here).  So if you wanted to attach a link to all occurrences
of "spam", now and after all future edits, one can fathom an independent
regexp that works on /spam/, along with others.  Conditional matches
also go a long way to eliminating fragility (failures not being
showstoppers).

Furthermore, to whatever extent you embed the content _into_ the
metadata, you have recreated an embedded-code representation but
decided to call it 'metadata'.  Uh uh.


Mmmm...any system of linking the meta-data to the text is going to
require embedded positioning of some sort, whether that's numerical
(after character 95, say) or content-relative (after the word "spumco").
Both require an unclean knowledge of the text.  You're right that a
content-relative representation appears to have a bit more knowledge;
I'm not sure that's automatically a bad thing.

The only representations I feel comfortable about manipulating are
in-memory multi-dimensional representations (a la Emacs buffers) where
all changes can be propagated immediately, and flattened representations
(a la XML) in which the metadata go wherever the data go and there is
no chance of their getting out of sync.


Aha!  I understand your idea much more fully now.  My entire bit about
regexps was based on the idea that meta-data and text would be
separate, possibly so separate that they might be on different machines,
or kept by different entities.  Once you assume a tight coupling ("...can
be propagated immediately..." and "...no chance of their getting out of
sync...") you auto-solve a lot of problems I was trying to fix.  You also
give up the idea of multiple instances of meta-data for one plaintext,
though.

 Xanadu had the luxury of being a
long-running server; Perl doesn't.


If you mean that you think this system has inefficiencies


No, I am concerned with the fragility of the regex approach.  Xanadu
could basically make up a multi-dimensional representation and
maintain it in perpetuity.  Perl, being a language used for transient
glue programs, does not have this luxury.  So the full glory of
separate text and metadata may be unachievable for Perl the language,
since we do not have the ability to ensure that text and metadata
remain in sync forever unless we write them out together (flat).


Yes.  If you can live without separate meta-data and plaintext, and
also without most uses of multiple instances of meta-data for
each plaintext, then the in-memory multidimensional tree is clearly
the goer.  It's only when you swing for the "separateness" fence
that you might try to optimize the "synchronize after unsynchronized
arbitrary edits" case.

On the same topic, could you add to the 'wanted' list hooks for
open(SPAM, "<http://whatever.spam.org";)
and open(SPAM, ">http://whatever.spam.org";);
?


I think it's about time for those.


Hosannas, etc.

F.