SGML compliance/compatibility

Dave Crocker <dcrocker(_at_)mordor(_dot_)stanford(_dot_)edu> writes:
|   
|   In casual discussion, I haven't found anyone particularly concerned
|   with SGML compatability.  Maybe even some preference to avoid the claim.
|   (I don't have a strong opinion, just the sense that we may not need to
|   be tied to that boat.)

    I'm particularly concerned with SGML compatability.  There has
been some discussion about it, and the authors of the RFC have
expressed quite clear animosity towards SGML, while paying lip service
to the syntax used by the language.  I'm not exactly pleased by that,
and Walt Daniels <dan(_at_)watson(_dot_)ibm(_dot_)com> has expressed what I 
felt to
begin with about this.  However, I think we could benefit from moving
closer to SGML.  We should at least not make it impossible to use
richtext with SGML tools.

    I've read the draft over a couple times, and I think the changes
that would have to be made to the draft in order to make it SGML
compatible (and to pave the way for a full SGML application, thus
benefiting from the many tools that can then be used on the text) are
minor.

    The <comment> "tag" is one example.  As it stands, a <comment>
will effectively hide all text up to and including the first
</comment>, regardless of nesting, and regardless of other </...>
tags.  Not only is this simple-minded, it's very difficult to handle
in SGML.  SGML has marked sections (<![ IGNORE [like this]]>), which
can be used for the purpose of hiding content, but they nest.  SGML
looks at delimiters and recognize the </ followed by a letter as a
"end tag open", which will terminate a <comment> even if it's not
</comment>.  Spurious </...> end-tags in the middle of the comment
will thus prematurely terminate the comment.  The comment is thus also
a non-hierarchical element.  E.g. the following grossness is legal:
<a><comment><b></a><a></b></comment></a>.  I don't see any need for
this added complexity in handling one specific tag very differently
than the others.

    Then there's the <nl> thing, which would have been an "entity
reference" (reference to an alternate input source, in this case an
internal one, consisting of a line break), such as &nl;.  Matter of
fact, SGML already has a built-in "character reference" for just this
purpose!  It's &#RE;.  The final ; is not necessary unless the char
ref is followed by more text, and if a new-line following is used
instead, it is eaten.  This can be used to break lines transparently
by having an empty (or ignored) entity or character reference.  &#RS
(note the missing ; at the end!) is already ignored in SGML, and may
serve this purpose.

    Another problem, hinted at by Guido van Rossum some time ago, is
the use of purely formatting-oriented tags.  Some of these make a lot
of sense.  <italic> and <bold> come to mind.  <center> and the other
adjusting tags are not so good.  <paragraph> is very close to the SGML
idea of descriptive and generic markup, if not exactly it.  This makes
the list of formatting tags in richtext a hodge-podge of tags of
several kinds.  I think a clean-up of this would make the language
cleaner, but it's possible to use pure SGML syntax for them as it
stands today.

    By making the syntax compliant with the rules of SGML, and adding
our own rules which limit the use of the full syntax (which is
complex), we can make use of many tools used for SGML documents, and
we can make our news and mail messages have a longer life span.

    I'm also concerned with formatting which is not representable on a
given display device, if used, you'd have to invent some translation
from richtext's idea of things to whatever is displayable.  If we used
slightly less formatting-oriented tags, this would be much less of a
problem than it is today.  (<smaller>, <bigger>, <center>, <flush*>,
<samepage>, and <np> are very printed-page-formatting-oriented.
<indent*> and <outdent*> less so, but still.  I don't think we should
limit ourselves to printed-page layout.  After all, people don't print
out their mail that often, and especially not just to be able to read
it.)

    The use of character set "tags" is also problematic, but we can
handle that in SGML, too.  No big problem.

    One added benefit of using `&' as the escape character for
Mnemonic, is that SGML also uses it for entity references.  An entity
set with all the mnemonic characters would be simple to make and use.

    I'm willing to take the time to suggest these changes in detail.
I think richtext will be the same language, with the same, if not
enhanced, expressive power, only with some changes to the specific
syntax and some of the semantics it uses.  SGML, in my mind, has
already solved the syntax problem quite nicely, and it's quite
possible to decide to use a smaller set of the syntax than SGML
affords.

    I already know that the richtext inventors don't like this.  What
about the rest of the Internet community, who might find that richtext
is "oh-so-close" to SGML, and still, they can't use their SGML tools
on it, can't search it with intelligent searchers, etc.

    I would favor a separate RFC for richtext, if that came up
seriously.

Best regards,
</Erik>

--
Erik Naggum       |  +47-295-0313     |  ISO 8879 SGML     |  Memento,
Naggum Software   |                   |  DIS 10744 HyTime  |  terrigena.
Boks 1570, Vika   | <erik(_at_)naggum(_dot_)no>  |  JTC 1/SC 18/WG 8  |  
Memento,
0118 OSLO, NORWAY | <enag(_at_)ifi(_dot_)uio(_dot_)no> |  SGML UG SIGhyper  |  
vita brevis.