Re: In-Band Information Considered Harmful

Chip Salzenberg writes:

In contrast, working frame-based, I only need walk the attribute tree,
find the attributes that apply to the given characters, and copy them.
That's O(log N) or so -- certainly better than O(N).  More
significantly, it requires *no* knowledge of metadata semantics.


Same for inline data.  There is absolutely no difference between
semantic of having metadata inline or separate.


I disagree.  Before you try to make this assertion again, please
explain how Perl would properly handle the 'll' case with code-based
metadata.  Be sure to allow for the various kinds of metadata nesting
behavior: <b> doesn't nest, <li> nests, and <p> marks a spot instead
of a region.  And Perl's RE and other character-processing engines
need to know this to handle them properly in the 'll' case.


Who cares how is it implemented?  We discuss *semantic* here.

While you're at it, don't you care that code-based metadata might match
/<b><i>yow/ but fail /<i><b>yow/, whereas frame-based metadata does not
suffer such a paradox?


I addressed this already (lookahead).  This does not directly address 

  match bold-ll in <b>well</b>

though.

This is how Emacs implements its markup.  It is a binary tree which
contains attributes-boundaries in the order they appear in the buffer.

Sounds like an excellent first implementation.

Until you want to modify the string.


... when you need to walk the tree and modify some position markers, an
extremely fast operation.  No problem whatsoever.


Yes, Emacs implements this.  But this still needs to be implemented
;-).  But this is again not a semantic question.  I see now that the
other branches of this discussion switched to semantic.

Good.  We need to discuss semantic first.

We need to balance substr-manipulation and RExen.


Indeed.


btree approach is acceptable in the sense that *any* operation gets a
small-multiplier (say, x20) slow-down only.  Inband approach is good
since *many* operations get almost no slow-down at all.

Remember Unicode discussion?  I was supporting utf8 approach (as
opposed to constant-width wide chars).  Larry noted that the
difference of Perl and C is that Perl emphasize REx operations over
the "offset" ones.  Offset operations (as substr) benefit from
wide-chars approach, REx operations benefit from utf8 approach.

We may reach the point in the discussion of the semantic when the
preferable semantic prohibits inband data.  We may reach the point
where inherent big-O(1)-constant of out-of-band approach is
prohibitive.  Let us discuss semantic/how-to-use-the-stuff first.

Ilya