perl-unicode

Re: In-Band Information Considered Harmful

1998-10-25 08:34:50
Chaim Frenkel <chaimf(_at_)pobox(_dot_)com> writes:
I don't follow you. If metadata and data are kept seperate there wouldn't
be _any_ confusion as to what is text. 

True - by definition. But what is meta-data and which is data depends 
on the use you make of the data/meta-data composite. 

And how does Postscript, Tk get
into the picture? Rendering is after the underlying data is understood.

On the contrary rendering occurs without _needing_ to understand the data.
That is to say for rendering purposes the 
Bold/Italic/Newline/Paragraph/Underline 
etc. _are_ the data and the actual characters are mere parameters (meta-data)
to those, i.e. the sense of data/meta-data is flipped.

Which is not to say that spliting information into markup & content is 
not valuable (if near impossible), but rather that I will want to apply 
perl operations to either.

In a rendering context I am far more likely to say "find bold word" (ignoring
which word it is) than I am I am to look for an occurance of a particular word 
(ignoring its boldness).

Defining the content in terms of markup leaves would imply that no
text exists that does not participate in the metadata system somehow.

NI> In the limit :
NI> The byte 0x41 is markup, saying render upper-case variant of 1st letter
NI> of Latin alphabet. 

The point I am trying to make here is that you cannot neatly divide markup 
and content - and that this extends all the way down to the bits within
the bytes!  The 0x20 bit of ASCII letters is the upper/lower case markup bit.
From a content point of view 'The' and 'the' are more or less identical,
the only difference being that 'T' and 't' codings tell the rendering engine
to use a different glyph. In a sense you could say the upper-case-ness was 
implied (for english) by the preceding /\.\s+/ (e.g. period space) markup.
Of course for German 0x20 is the "start-of-noun" bit ;-) which is meta-data
on another plane entirely!

The coding-as-markup is even more true in the case of Arabic
digits and Latin digits. (Since you have moved discusion to unicode list...)

And of course "\n" is definitely meta-data (markup) and not content at all ;-)
so /^/ is a markup-matching operation.

-- 
Nick Ing-Simmons