perl-unicode

Re: In-Band Information Considered Harmful

1998-10-25 14:30:58
Chaim Frenkel <chaimf(_at_)pobox(_dot_)com> writes:

I think we are in violent agreement. 

We two may be. I am not so sure about other contributors to the thread.

We are discussing where to draw the line.
Which will probably waver forever. 

And how does Postscript, Tk get
into the picture? Rendering is after the underlying data is understood.

NI> On the contrary rendering occurs without _needing_ to understand the data.
NI> That is to say for rendering purposes the 
Bold/Italic/Newline/Paragraph/Underline 
NI> etc. _are_ the data and the actual characters are mere parameters 
(meta-data)
NI> to those, i.e. the sense of data/meta-data is flipped.

Not really. What is happening is that the renderer _must_ work at multiple
levels. Here is the raw content of the Author. Here is the intention of
the Author. Here is the display device, Here are its qualties. Now make
the Author's intentions available to the Reader.

NI> Which is not to say that spliting information into markup & content is 
NI> not valuable (if near impossible), but rather that I will want to apply 
NI> perl operations to either.

NI> In a rendering context I am far more likely to say "find bold word" 
(ignoring
NI> which word it is) than I am I am to look for an occurance of a particular 
word 
NI> (ignoring its boldness).

I think a render would be more likely to ask am I _looking_ at bold rather
than searching for bold. 

Well "find the next word which is not in same font as I am currently using"
is something my PostScript renderer does do. As is find next paragraph,
find next item in this bulleted list. etc.

But that is still too low level. The renderer
should be looking at intensions. Bold/Italic is probably lower than an
Author should be working. 

Depends, SGML purists would agree, but many authors like to mess at 
this intermediate level.

The actual display attributes should be outside
of the Authors content development.

Depends how artistic they are at the time.


Defining the content in terms of markup leaves would imply that no
text exists that does not participate in the metadata system somehow.

NI> In the limit :
NI> The byte 0x41 is markup, saying render upper-case variant of 1st letter
NI> of Latin alphabet. 

No that's a side-effect of an early hack by the code page designers. 

Sure, but that does not make it untrue.


Uppercaseness is not markup, it is content. The Author in the language
decides where and what needs uppercase. If it were markup then the rendere
would be able to do all the work. 

I once had to write a 300 page manual where it did exactly that. The 
formatter (an odd roff dialect) case-folded all author text, then inserted
uppercase after /\.\s+/. There was special markup to avoid this behaviour
for things like "e.g. "

Just consider what would happen to your
name if it were markup and not content.

That was a problem with the system in question ;-)

NI> And of course "\n" is definitely meta-data (markup) and not content at all 
;-)
NI> so /^/ is a markup-matching operation.

No it isn't. 

Note the ;-) above.

Consider HTML 

HTML is not sane, it is a pragmatic hotch-potch.

or perl, '\n' is simply
whitespace. 

Which is markup - to divide input into tokens.

Linebreaks are controlled by the renderer under control of
the tags 

Why cannot I define "\n" as a "tag"? It is compact, natural for user
and can make tagged text more readable ...

and display device.
-- 
Nick Ing-Simmons