perl-unicode

Re: In-Band Information Considered Harmful

1998-10-23 15:37:17
Ilya Zakharevich wrote :
|| Chip Salzenberg writes:
|| > Start with a string, part of which has the "bold" attribute --
|| > something that might be written in HTML as "<b>hello</b> there".  When
|| > working code-based, when extracting the 'll' (perhaps with substr), I
|| > would have to be aware of the state (bold) at the point of extraction
|| > so I could know to extract "<b>ll</b>".
|| 
|| This is a very interesting question: how to cut-and-paste a piece of
|| enhanced text.  The current solution of EText Tk-widget is to cut "ll"
|| out of "<b>hello</b> there".  However, if you extract /llo/, you will
|| get "<b>llo</b>".  In other words: "tags" (hints which apply to
|| substrings of the text) are extracted only if the boundary of the tag
|| is hit.

An interesting question indeed.

Whether the attribute also applies to a piece cut out of the middle
can certainly depend upon the sort of attribute.  If you extract "ll"
out of <URL>http://perl.com/foo/ll</URL> it is certainly not
appropriate to retain the URL attribute.  XML has many attributes
that imply that the data has a specific structure.  So, for those, it
only makes sense to retain the attributes that have *both* boundaries
included.  But, for something like <b>, it makes more sense to retain
the attribute even if the data comes from the middle of the range -
that is an attribute that applies individually to each component -
although even there you'll often not want the attributes carried
along, depending upon your purpose in copying (e.g. if you copy a
filename from one place into a command to execute, you don't really
want to retain the bold attribute - but the out-of-band mechanism
will certainly make it not important if the attribute does get
copied, an in-band keeping of the attribute might be a nuisance).

|| > In contrast, working frame-based, I only need walk the attribute tree,
|| > find the attributes that apply to the given characters, and copy them.
|| > That's O(log N) or so -- certainly better than O(N).  More
|| > significantly, it requires *no* knowledge of metadata semantics.
|| 
|| Same for inline data.  There is absolutely no difference between
|| semantic of having metadata inline or separate.  We need more shallow
|| arguments than the semantic ones.

<HTML>  ... 100k bytes later ... <b>Hello</b>  ... </HTML>

Retaining enclosing inline attributes does require more effort,
unless you've built an out-of-line wrapping to collect its meaning.

-- 
objects:                                    | John Macdonald
    Think of them as data with an attitude. |   jmm(_at_)elegant(_dot_)com