perl-unicode

Re: In-Band Information Considered Harmful

1998-10-26 00:37:37
On Sun, 25 Oct 1998, Ilya Zakharevich wrote:

Dan Sugalski writes:
Assume four separate, completely unrelated streams of meta-data:

* Language (French, English, Latin, Perl)
* Display properties (color, font, size, orientation)
* HTML Markup
* Glossary links

If our source string is:

A simple perl statement looks something like print $foo, "\n". Easy, isn't
it?

Fully marked up in-band it looks like:

<LANGUAGE=English><P><COLOR=black><BGCOLOR=white><FONT=Albertus><FONTSIZE=12>
<FONTUNIT=point>A simple <A
HREF="http://www.perl.com";><GLOSSARY=perl><I>perl</I></A></GLOSSARY> 
<GLOSSARY=statement>statement</GLOSSARY> looks something like
</FONT></LANGUAGE><LANGUAGE=perl><FONT=Courier><CODE><GLOSSARY=Print
SUBGLOSSARY=perl>print
<FONTSTYLE=italic><GLOSSARY=Scalar variable
SUBGLOSSARY=perl>$foo</FONTSTYLE></GLOSSARY>,
<FONTMOD=notypographerquote>"</FONTMOD>\n<FONTMOD=notypographerquote>"</FONTMOD>
</FONT></LANGUAGE><LANGUAGE=English><FONT=Albertus>.
Easy, isn't it?

Now, extract the text and HTML markup. A task made more difficult by the
fact that the <FONT> tags are actually display property metadata, not HTML
metadata.

I (and I hope Chip) are not concerned by the extraction step.  It is a
job for a module, not for the core.

I hope Chip *is* interested in the extraction step. And the creation step. 
You can't deal with the data in a vacuum. That's not the only issue,
though. I'll get to that in a bit.

Going completely inband means that either you're limited to a single
metadata stream, you get ambiguous results *and* your program (or perl)
needs complete information about the metadata (if you want just the HTML
metadata from a mixed stream, you need to know all the HTML metadata), or
you need to have all the metadata creators coordinate their development
efforts so there's no overlap.

Did not understand a word.  So you have several types of metadata,
some of them have (type => 'font'), some have (type => 'language') in
the associated hash.  Now what?

If the data and metadata is mixed, and you have several types of metadata,
you have the potential for metadata collisions.

In the example I gave, there's a <FONT> metaelement in the HTML set, and
one in the markup set. The <FONT> metalelement appears in the data
stream that contains HTML metadata and display metadata. Which set does it
belong in?

Unless you're proposing some sort of registry for metaelements to avoid
clashes. You're not, I hope--that opens up a huge mess.

OTOH, multiple out-band metadata streams means that you can (with some
extended syntax, presumably) deal with only the streams that you're
interested it, not have to have intimate knowledge of the metadata
definitions, and have far less text to chew through if you don't want to
know about the metadata at all.

Have no idea what are talking about again.  Do you mean a REx escape
like \M{type=>font,style=>italic} or what?

Something like that, more or less. The actual syntax isn't that important.
(Though I'd go for \M{stream=>HTML,type=>font,color=green} if forced to
pick something out of the air)

Having multiple streams of metadata for your data makes it easier to
manipulate the data based on some of those streams without having to deal
with (or even know any of the details about) all of them

In the example I gave, I don't even want to begin to think of the code
required to extract the text and HTML markup only from an in-band data
stream.

  s/\M{^type==html}//g

To do that means that the RE engine needs to know about all the
metaelements that fall in the html class. You're really not proposing that
perl embeds intimate knowledge of HTML's markup into it, are you? And if
so, are you also proposing a mechanism such that perl can fetch and decode
the latest DTD so it's not left behind by new versions of HTML? And, of
course, what about all the *other* types of metadata? Shall we embed
those, too? Or just declare them as second-class metadata?

Out-band could look as simple as:

    $html_only = extract($source_string, "HTML");

I have no slightest idea why do you think that extract() cannot do it
with inline data!  Again, you mix semantic and implementation.

It can. But to extract one metadata stream from all the others requires
either the streams be separate (which I'm proposing), or for the extractor
to know an awful lot about the details of the metadata.

And yes, I realize that functionality could be built into modules for
in-band data, but then you need to keep it up-to-date with the latest
metadata versions or risk not getting everything, and spend a *lot* of CPU
time chewing through what's essentially meaningless data anyway. In our
case more than half the data is not text or HTML, but it's still got to be
processed to extract the text and HTML, *plus* there's more overhead
involved in checking to see which tags are HTML and which aren't. (And
it'll miss all those new HTML 6.2 things that shouldn't ever exist but
inevitably will) 

I do not see why do you think one implementation is going to be better
than another.

I probably wasn't clear.

I'm *not* proposing one data stream and one metadata stream. I'm proposing
one data stream and *multiple* metadata streams. One stream per metadata
type attached to the data.

Separating the data and metadata is a big efficiency win in many case.
Since my example wasn't that great, let's take another.

Take some of the web pages out there. 1K of text, 30K of Javascript code. 
Slurp that into a scalar. With the in-band method, you've got a 31K
scalar. Do a simple s/\bteh\n/the/; (to correct one of my common typos) 
and you've scanned through 31K, 30 of it absolutely meaningless to the RE
engine. That's 30K of data the RE *had* to run through *for no reason at
all*. It can't be skipped--who knows when the metadata ends? But it's a
waste of time. (For a more degenerate case, try it on your average Word
document--800K of document for 2K of text. But that's more an argument for
shooting Word, so we won't use it)

Metadata that's in separate streams is data the RE engine (and all the
other bits of perl) don't need to touch when they're dealing with the
normal data. And I expect perl will be dealing with the normal data far
more than the metadata for an awfully long time.

                                        Dan