perl-unicode

Re: In-Band Information Considered Harmful

1998-10-26 00:37:31
Dan Sugalski writes:
Assume four separate, completely unrelated streams of meta-data:

* Language (French, English, Latin, Perl)
* Display properties (color, font, size, orientation)
* HTML Markup
* Glossary links

If our source string is:

A simple perl statement looks something like print $foo, "\n". Easy, isn't
it?

Fully marked up in-band it looks like:

<LANGUAGE=English><P><COLOR=black><BGCOLOR=white><FONT=Albertus><FONTSIZE=12>
<FONTUNIT=point>A simple <A
HREF="http://www.perl.com";><GLOSSARY=perl><I>perl</I></A></GLOSSARY> 
<GLOSSARY=statement>statement</GLOSSARY> looks something like
</FONT></LANGUAGE><LANGUAGE=perl><FONT=Courier><CODE><GLOSSARY=Print
SUBGLOSSARY=perl>print
<FONTSTYLE=italic><GLOSSARY=Scalar variable
SUBGLOSSARY=perl>$foo</FONTSTYLE></GLOSSARY>,
<FONTMOD=notypographerquote>"</FONTMOD>\n<FONTMOD=notypographerquote>"</FONTMOD>
</FONT></LANGUAGE><LANGUAGE=English><FONT=Albertus>.
Easy, isn't it?

Now, extract the text and HTML markup. A task made more difficult by the
fact that the <FONT> tags are actually display property metadata, not HTML
metadata.

I (and I hope Chip) are not concerned by the extraction step.  It is a
job for a module, not for the core.

Going completely inband means that either you're limited to a single
metadata stream, you get ambiguous results *and* your program (or perl)
needs complete information about the metadata (if you want just the HTML
metadata from a mixed stream, you need to know all the HTML metadata), or
you need to have all the metadata creators coordinate their development
efforts so there's no overlap.

Did not understand a word.  So you have several types of metadata,
some of them have (type => 'font'), some have (type => 'language') in
the associated hash.  Now what?

OTOH, multiple out-band metadata streams means that you can (with some
extended syntax, presumably) deal with only the streams that you're
interested it, not have to have intimate knowledge of the metadata
definitions, and have far less text to chew through if you don't want to
know about the metadata at all.

Have no idea what are talking about again.  Do you mean a REx escape
like \M{type=>font,style=>italic} or what?


In the example I gave, I don't even want to begin to think of the code
required to extract the text and HTML markup only from an in-band data
stream.

  s/\M{^type==html}//g

Out-band could look as simple as:

      $html_only = extract($source_string, "HTML");

I have no slightest idea why do you think that extract() cannot do it
with inline data!  Again, you mix semantic and implementation.

And yes, I realize that functionality could be built into modules for
in-band data, but then you need to keep it up-to-date with the latest
metadata versions or risk not getting everything, and spend a *lot* of CPU
time chewing through what's essentially meaningless data anyway. In our
case more than half the data is not text or HTML, but it's still got to be
processed to extract the text and HTML, *plus* there's more overhead
involved in checking to see which tags are HTML and which aren't. (And
it'll miss all those new HTML 6.2 things that shouldn't ever exist but
inevitably will) 

I do not see why do you think one implementation is going to be better
than another.

Ilya