perl-unicode

Re: In-Band Information Considered Harmful

1998-10-24 00:50:02
Chip Salzenberg writes:
      a) mark char or a sequence of chars (e.g. Bold)

An Emacsish system handles these OK, as I think you'll agree.  But I'd
break them down into categories (a1) marking chars [a la mass nouns]
and (a2) marking entire substrings as units [a la count nouns], since
those two subtypes are handled differently when extracting and merging.

No.  The second category of your is a particular case of the category
"c" of mine - the case of "c" where the substring coinsides with the
whole string.

I disagree that your case 'c' is relevant to this example.  What I had
in mind was the <URL>...</URL> tag.  That's not (as you describe for
'c') an expression of a relationship of the substring to the larger
string.  

It is.  But apparently my short descriptions was not clear enough.
This metadatum describes a special kind of markup, where the "textual"
part (which we want to be "searchable") is organized in some
"structure" (see below, it was this "structure" thing which I tried to
shortcut to "relation of substring to the whole string").

As implemented in eText, the "structure" is a tree with leaves
carrying strings of textual data (and as usual, an arbitrary hash
associated to the whole structure).

In your example of URL the tree is degenerated into one leaf with the
text of URL in it, and no additional associated data.  For the table
the tree branches into (say) rows, and rows branch into cells, so the
tree is two levels deep (and theh additional data may describe details
of formatting).  For a fraction the tree has two leaves, the numerator
and the denominator.  For an NB margin mark the tree is empty, just
the presence of this structure makes the mark jump into existence,
there is no need to associate any "string contents" to this mark.
Same for a comment.

  lynx ftp://ftp.math.ohio-state.edu/pub/users/ilya/etext/etext.html

for details of "blocks" of eText.

In several years of trying I did not see any markup which would not
map into one of the three I described.

I stand by the bifurcation of (a1) and (a2).

See above.

      b) mark a boundary between chars (e.g. Footnotes)

I'd intended this to be countable metadata (category a2) attached to a
given position but with a length of zero.  But maybe that's not enough. 

It is not.  The behaviour wrt text insertion is different.  See below.

Your below text did not help me.  Please elaborate on the insertion
behavior difference.

In fact there is no insertion difference between my NB-mark example
for "c" and "b"-type markup.  There is deletion difference: "mark"
markup (sic) survives deletion of (enclosing) substring.  Embedded
"empty" markup is deleted with the substring it is contained within.

Nope.  You cannot insert anything *between* numerator and
denominator.  You can either insert something in numerator, or
denominator.

Your numerator/denominator example is a non-sequitor, as best I can
tell.  How is it relevant?  Please elaborate.

Eh?  There are some rules of consistency of markup.  One should define
what the any "editing" operation is doing to markup.  You cannot just
mark numerator and denominator by different markups - any editing
operation should keep them adjacent.  This creates the relationship
between them which is refered as "structure" above (and is a tree in
the implementation of eText).

Same with cells of a table.  You can assign an attribute "cell_5_4" to
a substring, and an attribute "cell_5_5" to an adjacent substring, but
this will break if you insert anything at the boundary of the cells.

The ability to commit semantic errors with a given markup scheme is
not interesting.  Anything expressive enough to be useful will have
that characteristic.

Maybe there are abstract examples which would support your maximalist
point of view, but with many real-life examples it is possible *and
easy* to keep markup consistent no matter what.

Ilya