perl-unicode

Re: In-Band Information Considered Harmful

1998-10-23 22:06:38
According to Ilya Zakharevich:
Chip Salzenberg writes:
Because Perl should not have to know the details and quirks of HTML,
XML, and all other encoding schemes.  It's a layering issue.

This is why we need an abstraction to which all the king's men map -
with as small loss of info as it is possible.

Could you please explain just what you have in mind?  There may be
holes in your answer, but I have no idea what kind of mapping you have
in mind.

      a) mark char or a sequence of chars (e.g. Bold)

An Emacsish system handles these OK, as I think you'll agree.  But I'd
break them down into categories (a1) marking chars [a la mass nouns]
and (a2) marking entire substrings as units [a la count nouns], since
those two subtypes are handled differently when extracting and merging.

No.  The second category of your is a particular case of the category
"c" of mine - the case of "c" where the substring coinsides with the
whole string.

I disagree that your case 'c' is relevant to this example.  What I had
in mind was the <URL>...</URL> tag.  That's not (as you describe for
'c') an expression of a relationship of the substring to the larger
string.  It's a plain standalone attribute -- but it's an attribute
that makes sense only in application to the URL as a whole, not an
extracted substring from the URL.

I stand by the bifurcation of (a1) and (a2).

      b) mark a boundary between chars (e.g. Footnotes)

I'd intended this to be countable metadata (category a2) attached to a
given position but with a length of zero.  But maybe that's not enough. 

It is not.  The behaviour wrt text insertion is different.  See below.

Your below text did not help me.  Please elaborate on the insertion
behavior difference.

In any case, it may be possible to do without this type entirely
(HTML does).

Emacs needs this (markers).

Yes, you're right.  I'm convinced -- metadata must be attachable to
either substrings or points (zero-length substrings).

      c) mark a substring of text as having a special relationship to
     a bigger substring of text (e.g. Tables)

Countable metadata (category a2) can't merge, so multiple countable
metas that cover overlapping areas are an easy representation of
nested tables:

     +--------- outer table ----------+
     |                                |
     |        + inner table +         |
     v        v             v         v

Nope.  You cannot insert anything *between* numerator and
denominator.  You can either insert something in numerator, or
denominator.

Your numerator/denominator example is a non-sequitor, as best I can
tell.  How is it relevant?  Please elaborate.

Same with cells of a table.  You can assign an attribute "cell_5_4" to
a substring, and an attribute "cell_5_5" to an adjacent substring, but
this will break if you insert anything at the boundary of the cells.

The ability to commit semantic errors with a given markup scheme is
not interesting.  Anything expressive enough to be useful will have
that characteristic.

(Also, I was thinking of attribute "cell", no elaboration.  And
attribute "row" to cover all the cells.  But that's just a detail
of usage, not of technology.)

I would guess that your oversimplified view on markup is related to
working with dead data.

You couldn't be more wrong.
-- 
Chip Salzenberg               - a.k.a. -              
<chip(_at_)perlsupport(_dot_)com>
 "... under cover of afternoon in the biggest car in the county?!" //MST3K