perl-unicode

Re: In-Band Information Considered Harmful

1998-10-23 11:16:22
Our favorite Ilya writes:
Chip Salzenberg writes:
I have a Tk widget with tags, and want to search for bold letter X
which follows non-bold one with Perl regexp.  How do you propose to do
it with non-sequential data?

Chip's/Ted's/Etc.'s idea of meta-data layering is a pleasurable one,
because it returns us to the good old days of information-as-a-stream-
of-unadorned-and-hence-extremely-manipulable-chars (Unix c. 1976)
while making room for the new requirements of markup.  As Ilya
points out, though, the separation naturally means there's a seam
between the content and its meta-data.

My guess is that meta-data layering is worth it, if only to avoid the
horrific HTML death spiral.  You'll note that it is now completely
impossible to meaningfully peruse a "well-designed" (e.g., full of
tables and frames for pixel-perfect brochure-like reproduction)
web site in a programmatic fashion without implementing a
HTML parser.

So assuming that layering is worth it, I propose the following problems
have the following solutions:

1.  "What about searching for a combination of meta-data and data?"

If you buy into the information-as-unadorned-chars and meta-data-
as-optional-and-nonintegral-layers-of-markup philosophy, then you
find yourself rarely attempting to use meta-data as part of a search.
But it would be possible to implement another regexp flag that
flattens the data and the meta-data layers into a single bytestream.
This would be at additional cost, but it's necessary anyway if you
have multiple possible meta-data layers.  For instance:

/Chip is back <bold>hey la</bold>/M:"http://chip.com/markup1.mar";;
/Chip is back <bold>hey la</bold>/M:$localmarkup;

2.  "How do you know a meta-data layer is appropriate to, and
synchronized with, a given piece of content?"

How about if meta-data layers are not themselves unadorned 
monotonic bytestreams, but instead something like regular
expressions?  For instance:

plaintext: This is a test of the emergency broadcast system.

layer: /^(This).*(test).*(broad.*t).$/(1=bold, 2=italic, 3=link:spam.html)

In this manner, edits to the text that don't disrupt the integrity of
the layer (for instance, inserting the word 'real' before 'emergency')
don't invalidate or corrupt the layer.  Since it's got all the power of
a regular expression, the layer creator would have the ability to
make her layer arbitrarily immune or picky.

3.  "How do you make this as invisible as possible?"

In one sense this would make life a lot easier for Perl users,
because there's currently no notion of meta-data at all in Perl,
so people have to roll it themselves in regexps, usually with
HTML.  Making it so that regexps never or rarely had meta-data
would make writing regexps a lot easier in the majority case.

The problem would be importing and exporting layered text,
I think; text can come in as flat ASCII versions of a layered
text (e.g., HTML), as unadorned ASCII (e.g., /etc/passwd),
as adorned non-flat ASCII (e.g. PDF), or as something
unspeakable (e.g., RTF).  And it wants to go out in any of
those formats, as well.

Warning: my tongue is about to speke heresees most vile.

So it would be great if the perl builtins (<FILE>, print)
intuitively understood about meta-data and organized it
themselves.  Here are some lame and contradictory examples
from which you can conceivably infer what I'm talking about.

($textline, $metadataline) = <HTMLFILE>;
$textline = <TEXTFILE>;
$textline = <PDFFILE.TEXT>;
$metadataline = <PDFFILE.METADATA>;

print PDFFILE.BOTH("textline", "metadataline");

I haven't thought this through much, but the alternative
(users have advanced knowledge of meta-data and text
and do their own organization) is pretty grotesque.

4.  "What about speed?"

In the large majority case, a system with regexp-like meta-
data-layers and a stream of unadorned chars would be faster
to handle than the current system (because the code writer
often has to keep in mind the possibility that her plaintext is
not completely plained yet).  Trying to do the brand new concept
of matches against metadata would be only slightly slower than
the current system (because in the current system, you have to
match against a regexp anyway).  Perhaps making sure that
layers were always studied and a flatten command existed for
the speed demons would help.

Tk widget stores metadata separately.  The question is how to
seamlessly apply Perl text-handling abilities to these data.  My
conviction (after spending *a lot* of time alone with my brain and
this question) is to use inband data, and modify Perl to handle these
data transparently.

This is a lot easier to implement, and it saves you the trouble
of having to keep two disparate things around.  However, you
tightly bind one version of the meta-data to the text, and give up
the very nifty multiple-meta-datas-for-one-text idea.  Further,
you lock yourself in to one particular meta-data representation.
That might itself be a good thing, but you'd have to be sure you
designed it to work sufficiently well with all of the file formats
that will ever exist -- tricky, but possible.  The separated-meta-data
approach permits you to come up with entirely novel and bizarre
meta-data formats as they emerge.

This is how Emacs implements its markup.  It is a binary tree which
contains attributes-boundaries in the order they appear in the
buffer.

However, regular expressions do not map well to this picture.  Emacs
has 3 different notions of search: by REx, by syntax (find matching
paren etc.) and by text attributes.  There is no simple way to combine
them.

Now Perl RExen are very close to allow you combind syntax and REx in
your "searches"/matches.  I want to have all three seamlessly merged.
The paradigm of using RExen for text-processing is too powerful to
be satisfied by half-measures.

I'm sure regexps can be made to handle in-band and out-of-band
meta-data fairly seamlessly; the failure of emacs to do it does not
mean it can't be done.  I think everyone agrees that the merge needs
to be seamless, if Topaz is to have an opinion about meta-data at all.

Oh, BTW, just in case you're not confused enough, layers also have
the valuable property of being able to be used in serial, as well as
parallel (multiple layers on one text in addition to multiple layers 
possible on one text), and further have the valuable property of being
able to selectively remove data from the text (which is more difficult
in in-band markup).  Child-safety layers, anyone: s/python/<censored>/g;

F.