Re: In-Band Information Considered Harmful

Dan Sugalski writes:

Going completely inband means that either you're limited to a single
metadata stream, you get ambiguous results *and* your program (or perl)
needs complete information about the metadata (if you want just the HTML
metadata from a mixed stream, you need to know all the HTML metadata), or
you need to have all the metadata creators coordinate their development
efforts so there's no overlap.


Did not understand a word.  So you have several types of metadata,
some of them have (type => 'font'), some have (type => 'language') in
the associated hash.  Now what?


If the data and metadata is mixed, and you have several types of metadata,
you have the potential for metadata collisions.

In the example I gave, there's a <FONT> metaelement in the HTML set, and
one in the markup set. The <FONT> metalelement appears in the data
stream that contains HTML metadata and display metadata. Which set does it
belong in?


I explained that I do not care about importation from a foreign
format.  The import-translator should take care of this according to
the rules of import format, and hints about what are you going to do
with the output.

Having multiple streams of metadata for your data makes it easier to
manipulate the data based on some of those streams without having to deal
with (or even know any of the details about) all of them


Words words words...

In the example I gave, I don't even want to begin to think of the code
required to extract the text and HTML markup only from an in-band data
stream.


  s/\M{^type==html}//g


To do that means that the RE engine needs to know about all the
metaelements that fall in the html class.


What for?  It sees that an element is marked as not having type =>
'html', it throws it out.

Out-band could look as simple as:

  $html_only = extract($source_string, "HTML");


I have no slightest idea why do you think that extract() cannot do it
with inline data!  Again, you mix semantic and implementation.


It can. But to extract one metadata stream from all the others requires
either the streams be separate (which I'm proposing), or for the extractor
to know an awful lot about the details of the metadata.


Nope.  The import routine should mark elements with appropriate keys,
after this core does not give a damn about anything else.

I'm *not* proposing one data stream and one metadata stream. I'm proposing
one data stream and *multiple* metadata streams. One stream per metadata
type attached to the data.

Separating the data and metadata is a big efficiency win in many case.


Words words words...

Since my example wasn't that great, let's take another.

Take some of the web pages out there. 1K of text, 30K of Javascript code. 
Slurp that into a scalar. With the in-band method, you've got a 31K
scalar.


Why?  It is 10K of "text" chars, + one char which has the javascript
code "inside the corresponding bitfield".

Do a simple s/\bteh\n/the/; (to correct one of my common typos) 
and you've scanned through 31K, 30 of it absolutely meaningless to the RE
engine.


Why do you need to scan through 1 char (even *if* it takes 30K *bytes*
for storage)? (*)

That's 30K of data the RE *had* to run through *for no reason at
all*. It can't be skipped--who knows when the metadata ends?


Everybody knows.  That's utf* - the length is known in advance.

Ilya

(*) With the implementation I have in mind every char ("pure" one and
    "metadata" one) takes at most 7bytes.  The chars with numeric
    value above 1<<32 (approx 15*(1<<32) of them) encode offsets into
    a global table of metadata.  Probably it is wise to take some bits
    off to denote insertion-deletion properties.