Re: (i18n 101) Re: data announcement

This is getting *very* UNIX-specific, and very email-irrelevant.  Can we 
please drop "ietf-822(_at_)dimacs(_dot_)rutgers(_dot_)edu" off of the 
distribution unless 
someone wants to draw it quickly back to email issues that can be 
discussed across system platforms.

I found Donn's note interesting, partially because some of it is 
historically incorrect.
  ***begin small tirade***
Various interesting rewritings of history notwithstanding, once upon a 
time there was a Multics project.  Regardless of its other strengths and 
weaknesses, Multics was probably the most intensivedly "designed" and 
critically debated and reviewed general-purpose operating system in the
history of the computer field.  UNIX started out, after Bell Labs left 
the Multics project, as an attempt to build something Multics-like with 
much more modest goals for machine sizes (and costs) and service base.  
To understand the fundamental early design decisions behind UNIX, you
really have to see them against a Multics backdrop: some things 
accepting Multics ideas uncritically (with the qualification that the 
UNIX principals and their management had been involved in those 
decisions in the first place) and other things done in other ways 
because of "learning" from Multics or reacting to Multics decisions that 
people disagreed with (either in retrospect or because they got outvoted 
the first time).

The UNIX file naming model was largely inherited.  Someone would have to 
go ask the authors, but I'd suspect that not a lot of thought was given 
to it at UNIX-time.

Now one of the important doctrines about Multics, from 64 until after 
the Bell Labs departure, was an assumption that it was a design for a 
utility and that "users" would never see the base operating system 
primitives or their manifestations; they would work in applications 
environments of one flavor or another.  And those applications 
environments would be built separately, by separate groups, and would 
not require "installation" into the system in the way that characterize 
several of the systems that Donn mentions.  If that is your model, you
need a very flexible naming structure so that applications can layer 
arrangements on top of it that suit *their* purposes.  You also need to 
keep all sorts of assumptions about how applications will be structured
completely out of the operating system because they will end up 
constraining either applications yet-to-be-thought-of or provide 
channels for security problems.
  In OSI language, it means you want extremely clean layering, with 
absolutely no assumptions in the lower layers about what the upper 
layers are going to look like or do.  Multics' layering in this regard 
is much cleaner than that of UNIX--the clean layering was a victim of 
efficiency constraints and different assumptions and goals.
  ***end small tirade***

A long, long, time ago, I was principal architect of an applications 
system that sat on top of Multics.  It used what is being called
"data announcement" this week, and used it down to the level of highly 
self-describing files, class operators that could figure out what to do 
based on the file descriptions, and similar things.  The operating
system layers didn't know about that system at all, which meant that it
had to do its own name space management, maintaining downward-looking
windows for accessing "normal" (undescribed) files when necessary.

Although I agree that data announcement is becoming necessary, putting
it in the file is the wrong answer, because that requires that all programs
be modified* to reflect that, and that some programs dealing with binary
data as "images" will have a fair amount of extra trouble dealing with it.
(It's very nice to have data aligned at zero in a file.)

  Donn, I think this is backwards.  You have to bind the file 
description to the file somehow, how you do it should be buried 
in a sufficiently deep layer that you don't care anymore whether "bound" 
is expressed by "in the same file at the beginning", "in the same file 
at the end", or "somewhere else".  If programs that don't know about 
self-describing files start working on them, they are, in general, going 
to get very confused and/or going to screw things up, since, for some 
purposes, you end up with stuff in the description that is real 
sensitive to the data-content of the file.  If you have programs, or an 
operating system, that think "zero" (or the first byte of a file) is 
special, and aren't able to virtualize that information by doing a bit 
of pointer-offset calculation, then you have an environment in which it 
is harder to do these sorts of things: your implementation choices are 
either somewhat constrained, or programming-life is going to be harder.
  But Erik's hypothesised fancy GUIs are really not going to care about 
old, pre-description, files.  They can't do anything much with them 
besides display icons with question marks (or pictures of turkeys) on 
them.

Additionally, simply extending the inode implies something approaching
omniscience on the part of the standardizers; what attributes are needed
and what are "temporary" needs, what are the future needs, and how much
room is "right" for expansion are all rather nasty problems to solve.

  Well, there is another way to do it, and that is to dump the inode as 
you know it (inadequate model for this sort of thing, as you imply) and 
replace it by a tuple that identifies the type of description (the 
secret for pulling "what attributes are needed..." out of the operating 
system) and the location of the description (the primitive for the 
file-content/file-description binding mentioned above).  This could be a
file name (if one was very careful about who got hold of "rm" and what 
they did with it and had a higher-level "delete" abstraction).  It could 
also be, e.g., an offset to where the description sat at the *top* of 
the physical file if you liked doing "seeks" or the equivalent and 
wanted the content at zero.  Lots of possibilities, including having a 
description type code for "old inode" for when that is necessary.

The fact that UNIX doesn't have typed files in the sense of this discussion
is NOT an accident, and is NOT the consequence of "sloppy design", but rather
quite intentional and thought out.

  Yes.  In Multics.

(Too strongly) typed files are more trouble than they're worth!

  More to the point, operating systems should not try to construe type 
information into file names.  THAT leads to bad problems.  Applications 
may be in a different situation.

To me, what makes sense is to develop a convention where if there is typing
information on a file, it is somehow associated with the file by name
and location, in another ordinary file.

  Most of these conventions get you into trouble eventually, for the 
same reasons that file name => type conventions do.  The havoc that can 
be created by people running around removing things they don't 
understand is the tip of the iceberg, but illustrative.

Yes, I realize that this creates a bookkeeping mess (and thus is a
"research problem")

  Nothing serious, really.

but it doesn't 
violate the design of UNIX (which is "kiss")

   I'd say it is "operating system keeps its hands off" and I think one 
could debate how well UNIX succeeds in that.  As I suggest above, there 
were compromises...

and it doesn't change the interfaces.

  Probably does if it is useful, if you are talking about interfaces to 
the user.  At the file system primitive level, of course not.  These are 
applications problems.

(Example: for each file <foo.bar> if there exists a 
file .,<foo.bar> (leading . to make it invisible, "," to keep it pretty
unique) then that file contains typing information.

  Might work.  Think about other models, including having a description 
directory s.t., the description for "foo.bar" is always in 
./descr/foo.bar (or ./.,descr/foo.bar if your UNIX implementation will 
permit that).  Makes it easier to write "don't put anything there, don't 
remove anything from there" rules.  You still need a model to prevent 
removing content without removing description.

By using a text file
with "field name: value" type of format, it's infinitely extendable without
ever breaking existing programs (if they're written from the beginning to
ignore fields they don't understand).

  Our experience is that you need one additional layer of abstraction -- 
the description type information -- and, after that, you can use models 
that are both more simple and more efficient.

    --john