perl-unicode

Re: In-Band Information Considered Harmful

1998-10-25 10:18:16
"NI" == Nick Ing-Simmons <nick(_at_)ni-s(_dot_)u-net(_dot_)com> writes:

NI> Chaim Frenkel <chaimf(_at_)pobox(_dot_)com> writes:
I don't follow you. If metadata and data are kept seperate there wouldn't
be _any_ confusion as to what is text. 

NI> True - by definition. But what is meta-data and which is data depends 
NI> on the use you make of the data/meta-data composite. 

I think we are in violent agreement. We are discussing where to draw the line.
Which will probably waver forever. 

And how does Postscript, Tk get
into the picture? Rendering is after the underlying data is understood.

NI> On the contrary rendering occurs without _needing_ to understand the data.
NI> That is to say for rendering purposes the 
Bold/Italic/Newline/Paragraph/Underline 
NI> etc. _are_ the data and the actual characters are mere parameters 
(meta-data)
NI> to those, i.e. the sense of data/meta-data is flipped.

Not really. What is happening is that the renderer _must_ work at multiple
levels. Here is the raw content of the Author. Here is the intention of
the Author. Here is the display device, Here are its qualties. Now make
the Author's intentions available to the Reader.

NI> Which is not to say that spliting information into markup & content is 
NI> not valuable (if near impossible), but rather that I will want to apply 
NI> perl operations to either.

NI> In a rendering context I am far more likely to say "find bold word" 
(ignoring
NI> which word it is) than I am I am to look for an occurance of a particular 
word 
NI> (ignoring its boldness).

I think a render would be more likely to ask am I _looking_ at bold rather
than searching for bold. But that is still too low level. The renderer
should be looking at intensions. Bold/Italic is probably lower than an
Author should be working. The actual display attributes should be outside
of the Authors content development.

Defining the content in terms of markup leaves would imply that no
text exists that does not participate in the metadata system somehow.

NI> In the limit :
NI> The byte 0x41 is markup, saying render upper-case variant of 1st letter
NI> of Latin alphabet. 

No that's a side-effect of an early hack by the code page designers. If
your interested I have an interested book that discusses the development
of TELETYPE, ASCII, EBCIDIC and why some of the decisions were made. Quite
a few was related to simplifying the circuits needed to implement them.

NI> The point I am trying to make here is that you cannot neatly divide markup 
NI> and content - and that this extends all the way down to the bits within
NI> the bytes!  The 0x20 bit of ASCII letters is the upper/lower case markup 
bit.

Uppercaseness is not markup, it is content. The Author in the language
decides where and what needs uppercase. If it were markup then the rendere
would be able to do all the work. Just consider what would happen to your
name if it were markup and not content.

From a content point of view 'The' and 'the' are more or less identical,
NI> the only difference being that 'T' and 't' codings tell the rendering engine
NI> to use a different glyph. In a sense you could say the upper-case-ness was 
NI> implied (for english) by the preceding /\.\s+/ (e.g. period space) markup.
NI> Of course for German 0x20 is the "start-of-noun" bit ;-) which is meta-data
NI> on another plane entirely!


NI> The coding-as-markup is even more true in the case of Arabic
NI> digits and Latin digits. (Since you have moved discusion to unicode list...)

NI> And of course "\n" is definitely meta-data (markup) and not content at all 
;-)
NI> so /^/ is a markup-matching operation.

No it isn't. Consider HTML or perl, '\n' is simply
whitespace. Linebreaks are controlled by the renderer under control of
the tags and display device.

<chaim>
-- 
Chaim Frenkel                                        Nonlinear Knowledge, Inc.
chaimf(_at_)pobox(_dot_)com                                            
+1-718-236-0183