perl-unicode

Re: In-Band Information Considered Harmful

1998-10-23 15:40:24
According to Ilya Zakharevich:
Chip Salzenberg writes:
Start with a string, part of which has the "bold" attribute --
something that might be written in HTML as "<b>hello</b> there".  When
working code-based, when extracting the 'll' (perhaps with substr), I
would have to be aware of the state (bold) at the point of extraction
so I could know to extract "<b>ll</b>".

This is a very interesting question

Isn't it, though.  :-)  I think it's the fatal flaw of code-based metadata.

The time penalty for that state awareness could be considerable,
certainly O(N).  And, worse, it would requires Perl to know which
codes nest, and how.

That is not important, since it is O(N) to find 'll' anyway.

It is important, depending on what you're doing.  You shouldn't assume
that a particular pattern of access is the only one that matters.

In contrast, working frame-based, I only need walk the attribute tree,
find the attributes that apply to the given characters, and copy them.
That's O(log N) or so -- certainly better than O(N).  More
significantly, it requires *no* knowledge of metadata semantics.

Same for inline data.  There is absolutely no difference between
semantic of having metadata inline or separate.

I disagree.  Before you try to make this assertion again, please
explain how Perl would properly handle the 'll' case with code-based
metadata.  Be sure to allow for the various kinds of metadata nesting
behavior: <b> doesn't nest, <li> nests, and <p> marks a spot instead
of a region.  And Perl's RE and other character-processing engines
need to know this to handle them properly in the 'll' case.

While you're at it, don't you care that code-based metadata might match
/<b><i>yow/ but fail /<i><b>yow/, whereas frame-based metadata does not
suffer such a paradox?

This is how Emacs implements its markup.  It is a binary tree which
contains attributes-boundaries in the order they appear in the buffer.
Sounds like an excellent first implementation.
Until you want to modify the string.

... when you need to walk the tree and modify some position markers, an
extremely fast operation.  No problem whatsoever.

substr() should be able to quickly return whatever is reasonable.
Yes, exactly my point.
We need to balance substr-manipulation and RExen.

Indeed.
-- 
Chip Salzenberg               - a.k.a. -              
<chip(_at_)perlsupport(_dot_)com>
 "... under cover of afternoon in the biggest car in the county?!" //MST3K