perl-unicode

Re: In-Band Information Considered Harmful

1998-10-23 12:40:35
According to Ilya Zakharevich:
Chip Salzenberg writes:
In a code-based scheme, metadata must be handled sequentially because
they _are_ sequential (along with the content).  In a frame-based
scheme, metadata do not need to have a sequence artificially imposed ...

I have a Tk widget with tags, and want to search for bold letter X
which follows non-bold one with Perl regexp.  How do you propose to do
it with non-sequential data?

I intend for the semantics of searching to be equivalent in power to
what you've been describing -- in other words, it will be possible to
do all the searches you've used as examples.

Patterns that include metadata queries can be slower with frame-based
metadata, since searching an attribute tree is a non-zero cost.
However, frame-based metadata searches require no maintenance of
state; likewise extractions and insertions.  This is, IMO, the
overriding concern.  Consider:

Start with a string, part of which has the "bold" attribute --
something that might be written in HTML as "<b>hello</b> there".  When
working code-based, when extracting the 'll' (perhaps with substr), I
would have to be aware of the state (bold) at the point of extraction
so I could know to extract "<b>ll</b>".  The time penalty for that
state awareness could be considerable, certainly O(N).  And, worse, it
would requires Perl to know which codes nest, and how.

In contrast, working frame-based, I only need walk the attribute tree,
find the attributes that apply to the given characters, and copy them.
That's O(log N) or so -- certainly better than O(N).  More
significantly, it requires *no* knowledge of metadata semantics.

As far as I'm concerned, frame-based metadata is clearly superior.

[Xanadu:] each person can create his own farm of hyperlinks -- content need
not have all of its hyperlinks included at creation time; rather, hyperlinks
are added on by people who discover/decide where it would be a good idea to
link things.

This is how Emacs implements its markup.  It is a binary tree which
contains attributes-boundaries in the order they appear in the buffer.

Sounds like an excellent first implementation.

However, regular expressions do not map well to this picture.  Emacs
has 3 different notions of search: by REx, by syntax (find matching
paren etc.) and by text attributes.  There is no simple way to
combine them.

The fact that Emacs doesn't integrate them well doesn't tell me
anything about whether it's possible to do so in Perl.

substr() should be able to quickly return whatever is reasonable.

Yes, exactly my point.
-- 
Chip Salzenberg               - a.k.a. -              
<chip(_at_)perlsupport(_dot_)com>
 "... under cover of afternoon in the biggest car in the county?!" //MST3K