Re: Comments on MIME/SGML

In message <9403072210(_dot_)AA02671(_at_)Accurate(_dot_)COM>, Ed Levinson 
writes:


The essence of my proposal is to replace the "dtd" parameter with "prolog"
and to require both prolog and instance.  The reason I suggest this
approach is practical, various implementations treat these two document
elements differently.


Hmmm... my reasons were chiefly practial too; they were based on
experience with the SGMLs package. Could you give some background (or
pointers to materials I should read) about these "various
implementations" that treat the prologue and the instance differently?

As to using text/sgml or application/sgml, I chose application to keep
within expressed boundaries others in the MIME community have
suggested.  Namely, that text be reserved for very simple things.


Formally speaking, it's a coin toss. I agree we should go with
whatever precedents are out there. application/sgml is fine. I just
don't like it when MH uses base64 encoding on my html body parts when
I know most of my audience can read html source -- perhaps I just need
to learn to use my tools better.

The correspondences you provided I like, it may be easier to explain
waht is happening using your table.  I summarize it, with my own
suggestions, below.

Do you find my proposal acceptable?


It's acceptable, but I'm not sure it's optimal yet.
Let's take another whack at the SGML->MIME correspondence:
[I won't comment on the SDIF terms as I haven't read the SDIF standard.]

      SGML:                   MIME:
      notation (type)         Content-Type:
      SYSTEM indentifier      Content-ID:
      data entity             Body Part


I can't find the term "notation type" in the SGML standard. I have
found:

        4.75 data content notation: An application-specific
        interpretation of an element's data content, or of a non-SGML
        data entity, that usually extends or differs from the normal
        meaning of the document character set.

and
        4.213 notation identifier: An _external identifier_ that
        identifies a data content notation in a _notation
        declaration_. It can be a _public identifier_ if the notation
        is public, and, if not, a description or other information
        sufficient to invoke a program to interpret the notation.

Also, "data entity" is not a term from the standard. We could use:

        4.134 external entity: An entity whose text is not
        incorporated directly in an entity declaration; its
        system identifier and/or public identifier is specified
        instead.

When I look at this closely, there's some redundancy: in SGML, the
choice of notations is expressed in the ENTITY declaration along with
the "filename" info. In MIME, the content type is expressed in the
referenced body part. When using MIME/SGML, we have to put it in both
places.

Is the connection between SGML notiation identifiers and the MIME
Content-Type syntax supposed to be explicit, or is there an implicit
correspondence between a MIME content type and an SGML data content
notation? For example, does the MIME content type show up explicitly
in the NOTATION declaration, like this:

        --8<
        Content-Type: application/postscript
        Content-ID: id1

        %!PS-Adobe...

        --8<
        Content-Type: application/sgml

        <!DOCTYPE T SYSTEM [
        <!NOTATION ps SYSTEM "application/postscript">
        <!ENTITY fig1 SYSTEM "id1" NDATA ps>
        ]>
        ...
        --8<--

or is it sufficient to write:

        --8<
        Content-Type: application/postscript
        Content-ID: id1

        %!PS-Adobe...

        --8<
        Content-Type: application/sgml

        <!DOCTYPE T SYSTEM [
        <!NOTATION ps PUBLIC "-//Adobe/PostScript" -- exact syntax? -->
        <!ENTITY fig1 SYSTEM "id1" NDATA ps>
        ]>
        ...
        --8<--

Hmmm... the implicit connection is probably more practical, but it
introduces redundancy and the chance for errors. The explicit mapping
causes the namespace of SYSTEM identifiers to include MIME
content-types. Blech.

      marked up text          Application/SGML
      document                Multipart/SGML


Up to here, we have been using terms from the standard. Your
suggestion to introduce the term "marked up text" is a departure from
what seemed like an otherwise elegant proposal. It's still
well-defined, but in application-specific terms rather than in SGML
standard terms. The question is whether practical considerations
sufficiently motivate the departure.

I suggested that we make the formal correspondence between the following:

        SGML entity             body of Application/SGML body part
c.f.:
        4.284 SGML entity: An entity whose characters are interpreted
        as markup or data in accordance with this International
        Standard.

The idea here is that MIME plays the role of entity manager, and MIME
body parts map 1-1 to SGML entities. The first production in the
standard is:

        [1] SGML document = SGML document entity
                (SGML subdocument entity |
                SGML text entity | non-SGML data entity)*

You can't split the prologue and the instance across SGML entities.
But you _can_ split the SGML document entity across system-specific
objects:

        NOTES
        1 This Internation Standard does not constrain the physical
        organization of the document within the data stream, message
        handling protocol, filesystem, etc., that contains it. In
        particular, separate entities could occur in the same physical
        object, a single entity could be divided between multiple
        objects, and the objects could occur in any order

Using the example I originally sent, we had:
                                                                SGML term or
        Content-ID:                     Contents                App convention
        <10024(_dot_)761615492(_dot_)3(_at_)ulua>       SGML document           
App
        <10024(_dot_)761615492(_dot_)4(_at_)ulua>       external entity         
SGML
        <10024(_dot_)761615492(_dot_)5(_at_)ulua>       SGML document entity    
SGML
        <10024(_dot_)761615492(_dot_)6(_at_)ulua>       SGML text entity        
SGML
        <10024(_dot_)761615492(_dot_)7(_at_)ulua>       SGML declaration        
App

Your suggestion makes it look like:
        Content-ID:                     Contents
        <10024(_dot_)761615492(_dot_)3(_at_)ulua>       SGML document           
App
        <10024(_dot_)761615492(_dot_)4(_at_)ulua>       external entity         
SGML
        <10024(_dot_)761615492(_dot_)5(_at_)ulua>       prolog                  
App
        <10024(_dot_)761615492(_dot_)6(_at_)ulua>       external entity         
SGML
        <10024(_dot_)761615492(_dot_)7(_at_)ulua>       declaration             
App
        <10024(_dot_)761615492(_dot_)8(_at_)ulua>       instance                
App

But in the end, it's not really critical that SGML text entities map
exactly to MIME body parts (even my proposal did app-specific stuff
with the SGML declaration). [Hmmm... until you start talking about
subdocument entities... I think a concrete example of this is in order.]

The critical thing is how all this interacts with available (and
conceivable) tools. For example, with either of the above examples, I
could do
        
        mhn store cur

and get several files: 4.sgml, 5.sgml, 6.sgml, ...
After I replace system identifiers (SYSTEM 
"10024(_dot_)761615492(_dot_)6(_at_)ulua")
with filenames (SYSTEM "6.sgml") in those files, I could validate the
document using:

        sgmls -s 7.sgml 5.sgml          # Connolly's version, or
        sgmls -s 7.sgml 5.sgml 8.sgml   # Levinson's version

Hmmm... about replacing system identifiers... this could be a _really_
tedious process. I wonder if we could get rid of this step somehow
(with something like the original Content-Reference stuff?). Let's
see... you could leave the SGML declaration body part alone. Then you
have to process the other parts in the order they will be presented to
the SGML parser... in fact, I think you have to parse them! Consider
the following pathological case:

foo.sgml:
        <!DOCTYPE T [
        <!ELEMENT T - - ANY>
        <!ENTITY example SYSTEM "ex1.sgml">
        ]>
        <T>blah blah, for example:
        <![ RCDATA [ &example; ]]>
        </T>

ex1.sgml:
        <!ENTITY foo SYSTEM "fake-file">

All the characters in ex1.sgml are data, even though they look like
markup.

        [AAARGH!!! My X server just died and emacs lost my last 3 hours'
        work on this message!]

Quickly, before I forget:

* As it stands, the MIME/SGML packer/unpacker cannot be implemented as
an SGML layer over MIME or as a MIME layer over SGML -- it must be a
piece of software that understands both simultaneously (see the above
entity usage). I suggest that instead of messing with the SYSTEM
identifiers in the data stream, we do an external mapping. Using the
above example, the packer would write:

        Content-Type: multipart/sgml; boundary="xxx";
                document="id2"; entity-map="id1"

        --xxx
        Content-Type: application/sgml-entity-map

        <id2>   "foo.sgml"
        <id3>   "ex1.sgml"

        --xxx
        Content-Type: application/sgml; name="foo.sgml"

        <!DOCTYPE T [
        <!ELEMENT T - - ANY>
        <!ENTITY example SYSTEM "ex1.sgml">
        ]>
        <T>blah blah, for example:
        <![ RCDATA [ &example; ]]>
        </T>

        --xxx
        Content-Type: application/sgml; name="ex1.sgml"

        <!ENTITY foo SYSTEM "fake-file">

        --xxx--

For most cases, this makes the packer and unpacker trivial -- it works
just like application/octet-stream. For cases where the sender's
filenames can't be encoded in the MIME name parameter, or cases where
the syntaxes of the sender and receiver's filesystems are different,
the entity-map provides sufficient information to make the necessary
translation.

* The character set section of the MIME/SGML draft is overly brief and
uses the nebulous term "ASCII." It should use the term US-ASCII, which
is well-defined in the Internet community, and equate it to
ISO-646-1983, which is the character set from the default SGML
declaration. It should also give at least one complete example of
using another charcter set (for example ISO-Latin-1 -- I tried for
weeks to figure out how to spell that in SGML).

* We need examples of usage of subdocument entities. I think this is
another facter that motivates the mapping of an SGML document entity
onto a single MIME body part (the alternative is to represent an SGML
subdocument entity as another multipart/sgml body part, then extract
the prologue and instance body parts, and concatentate them together
-- then you have the subdocument entity. Workable, but clumsy...)

* It's not clear how the single application/sgml body part works. The
example given was:

        Content-Type: application/SGML;
          dtd="-//USA-DOD//DTD MIL-M-21742 911001//EN"

        <! ... an SGML instance >

This implies an algorithm for producing an SGML document entity from
a public identifier for a DTD and an instance. I don't quite see how
to do this in general (what's the name of the DOCTYPE?).

Dan