perl-unicode

Re: In-Band Information Considered Harmful

1998-10-25 21:04:27
On Sun, 25 Oct 1998, Ilya Zakharevich wrote:

Dan Sugalski writes:
With in-band meta-data, how do you (the collective you, not any particular
you) plan on handling multiple, overlapping, unrelated sets of metadata?
It seems like it'd get awfully messy with a single data stream, which is
what we get with in-band metadata.

Like 

     "<b>foo <i>bar</b> baz</i>"

?  What is the problem you have in mind?  Since metadata is *easily*
distinguishable from data, there is no need to require proper nesting
even with in-band implementation.

Well, I was thinking of something along these lines.

Assume four separate, completely unrelated streams of meta-data:

* Language (French, English, Latin, Perl)
* Display properties (color, font, size, orientation)
* HTML Markup
* Glossary links

If our source string is:

A simple perl statement looks something like print $foo, "\n". Easy, isn't
it?

Fully marked up in-band it looks like:

<LANGUAGE=English><P><COLOR=black><BGCOLOR=white><FONT=Albertus><FONTSIZE=12>
<FONTUNIT=point>A simple <A
HREF="http://www.perl.com";><GLOSSARY=perl><I>perl</I></A></GLOSSARY> 
<GLOSSARY=statement>statement</GLOSSARY> looks something like
</FONT></LANGUAGE><LANGUAGE=perl><FONT=Courier><CODE><GLOSSARY=Print
SUBGLOSSARY=perl>print
<FONTSTYLE=italic><GLOSSARY=Scalar variable
SUBGLOSSARY=perl>$foo</FONTSTYLE></GLOSSARY>,
<FONTMOD=notypographerquote>"</FONTMOD>\n<FONTMOD=notypographerquote>"</FONTMOD>
</FONT></LANGUAGE><LANGUAGE=English><FONT=Albertus>.
Easy, isn't it?

Now, extract the text and HTML markup. A task made more difficult by the
fact that the <FONT> tags are actually display property metadata, not HTML
metadata.

Going completely inband means that either you're limited to a single
metadata stream, you get ambiguous results *and* your program (or perl)
needs complete information about the metadata (if you want just the HTML
metadata from a mixed stream, you need to know all the HTML metadata), or
you need to have all the metadata creators coordinate their development
efforts so there's no overlap.

OTOH, multiple out-band metadata streams means that you can (with some
extended syntax, presumably) deal with only the streams that you're
interested it, not have to have intimate knowledge of the metadata
definitions, and have far less text to chew through if you don't want to
know about the metadata at all.

In the example I gave, I don't even want to begin to think of the code
required to extract the text and HTML markup only from an in-band data
stream. Out-band could look as simple as:

        $html_only = extract($source_string, "HTML");

And yes, I realize that functionality could be built into modules for
in-band data, but then you need to keep it up-to-date with the latest
metadata versions or risk not getting everything, and spend a *lot* of CPU
time chewing through what's essentially meaningless data anyway. In our
case more than half the data is not text or HTML, but it's still got to be
processed to extract the text and HTML, *plus* there's more overhead
involved in checking to see which tags are HTML and which aren't. (And
it'll miss all those new HTML 6.2 things that shouldn't ever exist but
inevitably will) 

                                        Dan