Chaim Frenkel <chaimf(_at_)pobox(_dot_)com> writes:
I don't follow you. If metadata and data are kept seperate there wouldn't
be _any_ confusion as to what is text.
True - by definition. But what is meta-data and which is data depends
on the use you make of the data/meta-data composite.
And how does Postscript, Tk get
into the picture? Rendering is after the underlying data is understood.
On the contrary rendering occurs without _needing_ to understand the data.
That is to say for rendering purposes the
Bold/Italic/Newline/Paragraph/Underline
etc. _are_ the data and the actual characters are mere parameters (meta-data)
to those, i.e. the sense of data/meta-data is flipped.
Which is not to say that spliting information into markup & content is
not valuable (if near impossible), but rather that I will want to apply
perl operations to either.
In a rendering context I am far more likely to say "find bold word" (ignoring
which word it is) than I am I am to look for an occurance of a particular word
(ignoring its boldness).
Defining the content in terms of markup leaves would imply that no
text exists that does not participate in the metadata system somehow.
NI> In the limit :
NI> The byte 0x41 is markup, saying render upper-case variant of 1st letter
NI> of Latin alphabet.
The point I am trying to make here is that you cannot neatly divide markup
and content - and that this extends all the way down to the bits within
the bytes! The 0x20 bit of ASCII letters is the upper/lower case markup bit.
From a content point of view 'The' and 'the' are more or less identical,
the only difference being that 'T' and 't' codings tell the rendering engine
to use a different glyph. In a sense you could say the upper-case-ness was
implied (for english) by the preceding /\.\s+/ (e.g. period space) markup.
Of course for German 0x20 is the "start-of-noun" bit ;-) which is meta-data
on another plane entirely!
The coding-as-markup is even more true in the case of Arabic
digits and Latin digits. (Since you have moved discusion to unicode list...)
And of course "\n" is definitely meta-data (markup) and not content at all ;-)
so /^/ is a markup-matching operation.
--
Nick Ing-Simmons