perl-unicode

Re: In-Band Information Considered Harmful

1998-10-23 15:21:56
Chip Salzenberg writes:
Start with a string, part of which has the "bold" attribute --
something that might be written in HTML as "<b>hello</b> there".  When
working code-based, when extracting the 'll' (perhaps with substr), I
would have to be aware of the state (bold) at the point of extraction
so I could know to extract "<b>ll</b>".

This is a very interesting question: how to cut-and-paste a piece of
enhanced text.  The current solution of EText Tk-widget is to cut "ll"
out of "<b>hello</b> there".  However, if you extract /llo/, you will
get "<b>llo</b>".  In other words: "tags" (hints which apply to
substrings of the text) are extracted only if the boundary of the tag
is hit.

This may be reasonable/perfect (do not know...) for interactive usage,
but for a programmatic thing we need something better.

The time penalty for that state awareness could be considerable,
certainly O(N).  And, worse, it would requires Perl to know which
codes nest, and how.

That is not important, since it is O(N) to find 'll' anyway.

In contrast, working frame-based, I only need walk the attribute tree,
find the attributes that apply to the given characters, and copy them.
That's O(log N) or so -- certainly better than O(N).  More
significantly, it requires *no* knowledge of metadata semantics.

Same for inline data.  There is absolutely no difference between
semantic of having metadata inline or separate.  We need more shallow
arguments than the semantic ones.

This is how Emacs implements its markup.  It is a binary tree which
contains attributes-boundaries in the order they appear in the buffer.

Sounds like an excellent first implementation.

Until you want to modify the string.

substr() should be able to quickly return whatever is reasonable.

Yes, exactly my point.

We need to balance substr-manipulation and RExen.

Ilya