data announcement

Dear Folks,

The following is a proposed solution for the problem of "data
announcement" or "tagging data". That is, if we have a series of bits
(binary digits), how do we, and how do computers, know what it is.

At the moment, the IETF (Internet Engineering Task Force) is
discussing ways of extending the electronic mail formats being used in
the Internet. This message is an example of such a piece of email.
Those of you who own programs that do not hide the headers, will see
the email headers at the top of the message. The IETF is now
standardizing some extended headers that will allow absolutely
anything to be sent through email. Right now, we can only send ASCII
text, and we can send executable binaries by "uuencoding" them, but
often the receiver has to decode and extract the binary by hand. The
extended headers would allow automatic extraction. In short, the new
email headers will say what this sequence of bits is.

One of the beauties of the UNIX system is that *every* file is just a
series of bytes. There is no need to know about complex file types,
such as fixed-width records, variable-width records, etc. They are
just a series of bytes. The applications work out how to deal with the
series of bytes. Simple.

Unfortunately, this simplicity is also a drawback. We can only put so
much into an "inode" (a block of data associated with each file,
giving its size, owner, etc). It is, in general, difficult to extend
the inode without upsetting the whole filesystem. However, we would
like to extend the inode, since we have, over the years, come up with
quite a few ideas for extensions to the inode. For example, on the
Macintosh, all you need to do to get an application going is to double
click on a file. The corresponding application is started up by the
system, and the data file is fed to the application. On UNIX, this is
rather difficult to do, since we have no place to put the information
required to find the application, right?

Wrong. If all UNIX files start with a series of bytes that constitute
an ASCII header, much like this message's header, we can put all sorts
of information there. The end of the header is found by looking for
the first empty line. So we can extend the header many times and not
have to worry about fouling up any applications, since applications
can simply pick up the information that they need and then skip to the
end of the header. If something has been added to the header, that the
user wishes the application to notice, she will just have to upgrade
her application. Simple.

The crucial thing to notice here, is that the new extended email
format is intended to allow a sender to transmit to a receiver,
everything the receiver ever wanted to know about the data. Since this
sequence of bits, flowing across the network, encapsulates as much
useful information as we can imagine, this same series of bits is
eminently suitable for storing in a "file".

Now we need to lay down some rules, that are universal, and forever.
The header *must* always be in ASCII. Period. People on EBCDIC
machines may complain, but you can't convince me that IBMs are not
powerful enough to understand ASCII. After all, we have gateways that
convert between ASCII and EBCDIC for email between the Internet and
BITNET. Above all, the advantage of keeping the header in ASCII, is
that even if you network-mount an EBCDIC filesystem from an
ASCII-based system, the latter at least has the *possibility* of
trying to interpret the file. This may involve EBCDIC->ASCII
conversions of the body of the "message". (Maybe.)

Some people may also comment that ASCII headers are biased towards
English-speaking populations. I don't think this will be a problem.
For one thing, although English is not a universal language, it is
rapidly becoming a universal *second* language. For another, we can
always have a second header that is in some other character encoding,
thus allowing non-English names, subject titles, etc.

Also, the ASCII header may be large (especially in comparison to small
pieces of data), but disk prices are coming down, and so are
transmission costs. This is not a very high price to pay for great
readability and great extensibility.

It is possible that, at some time in the future, a user will receive a
piece of email that she does not have any application programs for.
The advantage of this ASCII header scheme, is that she can then take a
quick look at the header, to find the name of the application. It
might say, "FrameMaker". She looks up this name in a central directory
(offered through a network service), and finds that she can purchase
this product at such-and-such a street.

One of the questions is: What do we put in these headers? If we double
click on a file, then the system needs to find the application. We
could put e.g. /usr/bin for UNIX systems. However, if we send this
file as a message across the network, "/usr/bin" may not mean anything
to a Macintosh. This does not mean that we *cannot* put this in the
header. The Macintosh can always ignore the "UNIX-Path:" header.
However, we may like to put this kind of information in a central
directory. For example, a UNIX host might map "FrameMaker" to
"/usr/bin" in one of the "/etc" administrative files. The
possibilities are endless.

There are also some things that we probably do *not* want to put in
the headers. E.g. the size of the thing. As a message on the network,
the size is already established by a higher level protocol (or lower
level, depending on your point of view), in this case, SMTP. The data
starts with DATA<CR><LF> and ends with <CR><LF>.<CR><LF> (I think). As
a file on disk, the size is known by the filesystem (e.g. in UNIX, the
inode's size field). The only "size" header that we may want to
include, is the number of bits in the last byte (which may be 7 bits
or 8 bits or whatever, depending on the protocol). If we specify the
number of bits in the last byte, we can get the granularity right down
to the smallest unit (the bit).

But what about migration? How on Earth do we get from here, to there?
Well, there is no real need to update *all* files at once. We can
solve this problem piece by piece. For example, if we have some
FrameMaker files, we shouldn't add the ASCII headers until we have
upgraded our copy of the FrameMaker application. If, after upgrading,
we still have some FrameMaker files *without* the header (e.g. on a
remote-mounted system), the application can always check for (the
absence of) the header, and behave intelligently. If some existing
files *already* contain an email-like header, the application can
check for new keywords (like "Application:"), which probably won't
exist. If they do, tough luck. What can I say?

Actually, what do *you* say? Is this idea just totally infeasible?


Regards,

Erik M. van der Poel                                      
erik(_at_)sra(_dot_)co(_dot_)jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692