Richtext and SGML

Hi,

    I promised a report on richtext and SGML, and a draft version is
attached.  It is not complete.  I was unfortunately interrupted on
February 13th (a Thursday, not a Friday, although it should've been),
and I'm leaving for TechDoc '92 at Fort Lauderdale, FL, tomorrow
morning, so I don't have time to finish it.  I'll return on March 11.
Most probably, I'll be relieved of mail and news in this period. :-)

    The report, as it exists today, does not contain the DTD or the
specific suggestions, but an outline of the suggestions can be
intuited from the extant text.

    Feel free to mail me.  A vacation daemon should reply only once,
if at all. :-)

Regards,
</Erik>
--
Erik Naggum       |  +47-295-0313     |  ISO 8879 SGML     |  Memento,
Naggum Software   |                   |  DIS 10744 HyTime  |  terrigena.
Boks 1570, Vika   | <erik(_at_)naggum(_dot_)no>  |  JTC 1/SC 18/WG 8  |  
Memento,
0118 OSLO, NORWAY | <enag(_at_)ifi(_dot_)uio(_dot_)no> |  SGML UG SIGhyper  |  
vita brevis.

---------------------------------------------------------------------------
<!--ID PUBLIC "+//ISBN 82-7640//DOCUMENT Richtext and SGML/19920213//EN"-->
<!DOCTYPE opinion PUBLIC "+//ISBN 82-7640-000//DTD General Document//EN" 
                          "ftp.ifi.uio.no:/pub/SGML/DTD/mod-general"
[
  <!-- modifications to public DTD -->
  <!ENTITY % doctype "opinion">
  <!ENTITY % p.em.ph "cit|emph|term|keyw"
        -- cit, emph: italic; term: fixed-width; keyw: bold -->

  <!-- additional short reference mapping -->
  <!ENTITY aline STARTTAG "aline">
  <!SHORTREF addrmap ";" aline -- semi starts new aline -->
  <!USEMAP addrmap address>

  <!-- literal strings -->
  <!ENTITY lt  CDATA "<">
  <!ENTITY amp CDATA "&">
]>

<opinion status="draft" version="4">
<frontm>
<titlep>
<title>Richtext and SGML
<date>1992-02-13
<author>Erik Naggum
<address>Naggum Software;Box 1570 Vika;N-0118  OSLO;NORWAY;
erik(_at_)naggum(_dot_)no;+47-295-0313
</address>
<abstract>Abstract
<p>
This paper discusses differences and similarities between the
languages richtext (as specified in the Internet draft document
draft-ietf-822ext-messagebodies-03) and SGML (as specified in ISO
8879:1986).  Several issues concerning the semantics of the language
is included in this discussion, as they will also need to be addressed
in designing an SGML document type.  A summary of changes to richtext
necessary to make it SGML-compliant concludes the paper.
<!>
<body>
<h1 id=intro>Introduction
<p>
The Internet Engineering Task Force (IETF) Working Group on RFC 822
Extensions (822ext) has collectively prepared an Internet Draft,
<cit>MIME (Multipurpose Internet Mail Extensions)</cit> edited by
Nathaniel Borenstein and Ned Freed.  MIME contains provisions for an
"enriched" text format, called richtext as a subtype of the
<term>text</term> message body type.
<p>
The working group has decided that richtext be SGML compliant, but
significantly simpler than full SGML.  This has predictably produced a
simple markup language with distinct SGML flavor, but the language
defined is also departing from SGML in significant ways.  Although
several SGML-knowledgable people participate in the working group, the
issue of compliance with SGML has not been given full consideration.
It is the objective of the present paper to provide this.
<p>
This paper is divided into several sections.  <hdref refid=markup>
lists differences in terms and concepts which serve to separate
richtext conceptually from SGML, specifically in terms of procedural
and generic markup.  <hdref refid=lex> deals with the lexical level of
the two languages, and details differences and similarities.  <hdref
refid=syntax> looks at the syntax of richtext and describes this in
terms of SGML.  <hdref refid=semant> compares the semantics of the
richtext formattin commands.  Finally, <hdref refid=suggest> concludes
the paper with specific recommendations for the draft, as well as
suggesting a possible document type definition (DTD) for richtext,
which aims at preserving the semantics of the language, as it is
discussed in <hdref refid=semant>.
<!>
<h1 id=markup>Procedural vs Descriptive Markup
<p>
A number of concepts are different between richtext and SGML.
Richtext is essentially a procedural markup language, in which the
markup consists of relatively simple and "hard-wired" instructions for
a formatter, with a clearly defined meaning and result.  SGML is a
descriptive (also known as generic or generalized, each with further
qualifications) markup language, the basic premise of which is that
the presentation of a given piece of information depends on its type
(about which the markup provides information), that all pieces of the
same type should be given the same presentation, and that for any
given type, <emph>some</emph> presentation must be chosen, but it can
be <emph>any</emph> presentation.  Generic markup languages exhibit
the quality that their markup is rigorous, and that there is a
structure to the document which it shares with other documents of the
same document type.  The task of generic markup languages comprise
both the specification of the document type, the markup language used
in it, and the syntax for document conforming to the type.
<p>
Specifically, richtext speaks of "formatting commands", which stem
from its procedural markup language heritage.  A "formatting command"
gives an indication of how the following text is to be formatted and
presented to the reader of a richtext message.  Richtext formatting
commands come in pairs, effectively an enabling and a disabling
command for each formatting instruction.  An enabling formatting
command has the general syntax, "<", command name, ">", and the
disabling formatting command the general syntax "</", command name,
">".
<p>
SGML has the notion of an element named and accessed by its generic
identifier.  An element in SGML consists of three parts, the
start-tag, the content, and the end-tag.  (There are also empty
elements, which only have start-tags.)  A start-tag has the general
syntax, "<", generic identifier, ">", and an end tag the general
syntax, "</", generic identifier, ">".  Superficially, therefore, the
two languages have the same syntax.  However, in SGML, the tags serve
to delimit the content of an element, whereas in richtext they serve
primarily to turn on and off formatting or presentation functions.
<p>
The difference between element and formatted text can also be found in
the description of their content, and specifically of nested elements.
Richtext specifies that the enabling and disabling formatting commands
must match, but takes exception from this rule with one particular
command (<term>comment</term>).  SGML specifies that element content
can either be sub-elements or data (i.e., text), or both.  In the case
of a sub-element, it is wholly contained in the parent element, and
both the semantics and the syntax of the language prohibit the
opposite.
<p>
Yet, with this important conceptual difference between the languages,
it is possible, indeed quite simple, to describe richtext in terms of
SGML.  A few compromises have to be made, primarily in the lowest
level (lexical) and the highest leve (semantics), and they will be
detailed below.
<p>
For the purposes of simplifying the following discussion, I will use
the term "element" to denote both an SGML element and the tuple
(enabling formatting command, content, disabling formatting command)
in richtext.  The term "start-tag" will be used to refer to both
start-tags in SGML and enabling formatting commands in richtext.
Likewise, "end-tag" will refer to both SGML end-tags and disabling
formatting commands.  The term "tag" will then be used to denote
either start- or end-tags, as fits the context.
<p>
Another difference resulting from the procedural versus generic markup
model is that there is no difference between the function of a
formatting command which produces data on its own and one which only
modifies the presentation of if its contents.  richtext has the same
syntax and representation for these two, whereas SGML have them
separated.<![ IGNORE [Include a discussion of entities and entity
references?]]>
<!>
<h1 id=lex>Lexical Similarities and Differences
<p>
This section documents differences between the exact meaning and
lexical value of delimiters used in the both richtext and SGML.
<!>
<h2>Delimiters
<p>
As mentioned above, both richtext and SGML use the same delimiters
around their "tags".  Much of the complexity of SGML lies in its
lexical rules, and the ways in which it recognizes markup in text.
Richtext will always treat "<" as the start of a formatting command,
regardless of what follows it.  In SGML, a "<" is not recognized as a
"start-tag open" unless it is followed by a character which could
start a generic identifier.  The primary reason for this is that human
readers of an SGML document will always be able to see from context
whether a "<" opens a start-tag or is meant to be the literal
less-than sign.
<p>
In richtext, therefore, a literal less-than sign will always have to
be encoded as a special formatting command which produces data (a
literal less-than sign!), e.g. as "&lt;lt>".  In SGML, this is markup
of a different breed, and it is represented as "&amp;lt;".  To make an
SGML parser ignore a "<", it is only necessary to let it precede a
character which can not follow it in valid SGML markup, e.g. a space
character.  Since "<" is always special in richtext, error handling
code must be written to survive stray "<"s in the text.
<!>
<h2>Length of Tags
<p>
The maximum length of tags is another difference between richtext and
SGML.  Richtext specifies that the maximum length of formatting
commands is 40 characters, excluding the delimiters.  The minimum
maximum length of generic identifiers in SGML is eight (8) characters,
and although this is deemed by many to be ridiculously low, this is
the value specified in the standard.  The extra 32 characters allowed
by richtext will be an error if the same document is parsed in an SGML
context.
<!>
<h2>Miscellanous Syntax of Generic Identifiers and Tags
<p>
In SGML, a generic identifier consists of an initial letter, followed
by alphanumerics, hyphens or periods.  Richtext appears to allow
digits and hyphen as the first character in the name of a formatting
command.  A formatting command starting with a hyphen or digit will
not be recognized as markup in an SGML parsing context, and will
remain data; thus not being an error, but still causing results
different from the intended.
<p>
SGML allows the "tag close" delimiter (">") to be preceded by white
space.  In richtext, this would be an error.  However, richtext parsed
in an SGML context would not be erroneous in this respect.
<p>
SGML allows attributes in start-tags, to function as information about
the element while not being part of its content.  The generic
identifier is separated from the attribute list by white space, as are
the individual items in the attribute list.  The attribute list
consists of attribute values from a list of tokens, or as pairs of
attribute name and value, separated by "=", optionally surrounded by
white space.  Should there be a white space, or a missing ">" in a
richtext formatting command, a large number of errors will ensue, as
subsequent words will be attempted parsed as attribute values for
non-existent attributes.  (Both richtext and SGML will thus misbehave
in the presence of unintended white space in tags -- so this issue
will have to be given special attention in richtext conformance
clauses, with some rule about the responsibility of sending systems to
ascertain conformance of the document.)
<!>
<h2>Lines and Attendant Problems
<p>
SGML as a language does not know about lines, or "records", as they're
known in the standard, but does support their existence, and means to
recognize them.  A "record" in SGML parlance is delimited by a Record
Start (RS) and a Record End (RE) character.  These characters are
virtual, but are represented by ASCII Line Feed (LF, 0x0a, '\n', etc),
and Carriage Return (CR, 0x0d, '\r', etc), respectively.  With a
CRLF-delimited sequence of lines, this maps intuitively to line
breaks.  There are special problems associated with handling records,
and the most important solution to this problem in SGML, is to ignore
a Record End if it comes right after a start-tag, or right before an
end-tag.  (This is a simplification of the complete set of rules, but
true under general conditions, such as when comparing richtext and
SGML.)
<p>
The Record Start is consistently ignored in SGML, unless some special
provisions are introduced to interpret it as markup (a description of
this feature is omitted for brevity), but the Record End is, when it
is not ignored, treated as data in its own right.  It is up to the
SGML application to decide whether it's to be treated as a space or as
a line break.  (Generally, certain elements, such as paragraphs, will
produce breaks and separating vertical space around their content,
which serves the purpose of a "hard" line break in word processors.)
<p>
In richtext, a CRLF is deemed to be an alternate representation of a
space at all levels of the markup, and there is no way to specify an
alternate behavior only in some elements.  This is what would be done
in an SGML application if the content was data.  The present draft is
unclear on whether CRLF's occurring immediately next to formatting
commands will be interpreted as space or ignored.  The specification
implies that a sequence of CRLFs will be converted to one space
character, regardless of its length and location.  In lieu of the
discussion on the list on interpreting N CRLFs in sequence as meaning
(N-1) "hard" CRLFs, this point may need to be revisited.
<!>
<h1 id=semant>Semantics of Richtext Formatting Commands
<p>
Richtext uses a variety of formatting commands to achieve a number of
typographical functions, and there are several classes of the
semantics involved, without documentation in the draft.  The following
attempt at a description builds on examples published on the list, as
well as an intuitive understanding of the functions desired, either of
which may be wrong.
<p>
Although not precisely SGML issues, the following discussion stresses
the issues of semantics of the formatting commands that would need to
be formalized in an SGML document type definition.  This is thus only
serving to provide a basis from which to build an SGML application.
<!>
<h2>Bold, Italic
<p>
The contents of the <term>bold</term> (<term>italic</term>) element
will be displayed with a boldface (italic) font.  The function seems
to be a modifier of whatever is the <emph>current typeface</emph>, in
that this should be emboldened (italicized), as opposed to selecting a
predefined "boldface" ("italic") font to replace it.
<p>
Unlike the <term>smaller</term> and <term>bigger</term> elements,
neither <term>bold</term> nor <term>italic</term> seem to be additive,
that is make the text more bold or more italic.
<!>
<h2>Fixed
<p>
<term>Fixed</term> seems to be different from both <term>bold</term>
and <term>italic</term> in that it selects a different typeface than
the current.  It it is not clear whether the effects of
<term>bold</term> and <term>italic</term> are retained inside
<term>fixed</term>.
<p>
<term>Fixed</term> is obviously not nestable.
<!>
<h2>Smaller, Bigger
<p>
<term>Smaller</term> (<term>bigger</term>) seem, like
<term>bold</term> and <term>italic</term> to modify the <emph>current
typeface</emph>, and thus only to reduce (increase) its point-size.
These commands seem to be additive, in that nested elements further
reduce (increase) the point-size of the current typeface.
<!>
<h2>Underline
<p>
This element type has not been used widely enough to warrant
conclusive comments, but the question remains whether
<term>underline</term> is additive, i.e. that nested
<term>underline</term> elements produce additional lines under its
contents.
<p>
Underlining is usually used as a replacement for italics in the
absence of the latter, so this may be an issue of rendition for
different output devices, and the two may in practice be merged to one
for any given output device, unless it supports both.
<!>
<h2>Center, FlushLeft, FlushRight
<p>
These elements have intuitive functions as long as they are mutually
exclusive.  The current draft says nothing about nesting them.  If
this is attempted, several interesting things might happen, depending
on how they are implemented.  One possible interpretation is that
these commands imply a nested <term>paragraph</term> element.
<p>
It is not clear whether the contents of a flushright element inside a
flushleft is justified to both margins, or whether one of them take
precedence over the other.
<!>
<h2>Indent, IndentRight
<p>
The intuitive function of these commands is to set off the left
(right) margin, and it's also intuitive that both are additive, and
not mutually exclusive.  To indent from both left and right margin,
one would nest <term>indent</term> within <term>indentright</term>,
for example.  A paragraph or line with some text prior to the
<term>indent</term> command might be interpreted as a hanging
indent.
<!>
<h2>Outdent, OutdentRight
<p>
I don't understand what the intended purpose of these is, so I'll
defer comments on them until I do.  And although the names are
intuitively understandable next to "indent", they are not good names
for the opposite of "indent".
<!>
<h2>Samepage
<p>
This element can have many functions, but I haven't seen it used, so
the following is speculation.  One function may be to prevent figures,
examples, and tables from being split across pages.  Another function
may be to preempt the need for widow/orphan logic in the presentation
module.  (A "widow" is a single line from a paragraph separated from
it by a page break.  An "orphan" is a title for a headed paragraph
separated from the paragraph text by a page break.)  However, these
functions should be addressed individually.  In a generic markup
language, one would tend to have a concept of "headed section", of
which the title be an integral part.  This higher-level "unit" is lost
to procedural markup languages.  <term>Samepage</term> is not likely
to nest, but what content should be expected inside it is hard to
predict.
<!>
<h2>Subscript, Superscript
<p>
The presentation associated with these are obvious: a reduction in
point size, and a vertical movement of the baseline.  The uses for
sub- and superscripting are many, including chemical formulae, array
and matrix indices, footnote indicators, area and volume units, power,
and more.  Depending on the presentation medium, the current typeface
may have separate characters for some, if not all, superscripted
digits, but may have to do more work to present the full range of
super- and subscripted text.
<p>
There is presently no provision for footnotes in richtext, so these
elements may conceivably be used to present their indicators, and
<term>smaller</term> to mark up their text.  This may be a serious
problem, since footnotes on display terminals may be given radically
different presentations than would footnotes in printed form, even
though printed footnotes are occationally also found in the margin of
the text.
<p>
For mathematical formulae, it is occationally necessary to allow
superscripted superscripts.  It is therefore not reasonable to
disallow nesting of these elements, but only of the type of the
outermost element.
<!>
<h2>Heading, Footing
<p>
Both of these are inherently page-oriented, and since I haven't seen
them used, I can only conjecture how they would be used.  For a
page-layout system, a page is divided into three or more parts,
including at the least heading, text area, and footing.  The
specification of the heading will have to occur before any of the text
in the text area, if it should apply to the first page, and a
<term>heading</term> found after the start of the text area would then
be construed to apply only to subsequent pages, if any.  (It might
therefore be reasonable to use a new <term>heading</term> after each
title of major sections, so as to let the title of a continued headed
section be displayed in the page heading.)  The same applies to
<term>footing</term>, although to a lesser degree.  The question of
different headings and footings for verso and recto pages is not
addressed.  Likewise, page numbering issues should be addressed.
<p>
The function of a <term>heading</term> or <term>footing</term> element
would be to store the content for later presentation as page breaks
occur.  Since some presentation devices will not be germane to paged
display, these elements should perhaps be suppressed in the minimal
richtext interpreter (Appendix D in the draft), rather than be
displayed where they are found, as this might interfere with the
running text.
<p>
Then there is the effect of the current typeface, as modified by
several other formatting commands, as well as issues of margins with
respect to indented text.  Headings and footings are traditionally
processed in a different environment than the running text, and with
separate line length and positioning directives.  There might
therefore be a need for a <term>reset</term> formatting command which
would reset all states to a system default for its contents, unless
the semantics of <term>heading</term> and <term>footing</term> already
subsume this function.
<p>
Obviously, neither <term>heading</term> nor <term>footing</term>
nest.
<!>
<h2>ISO-8859-X, US-ASCII
<p>
<term>ISO-8859-X</term> is actually a family of formatting commands,
valid for each defined value of <term>X</term>.  <term>US-ASCII</term>
and <term>ISO-8859-X</term> both modify the mapping from coded
character to glyph performed by the display module.  The
<term>ISO-8859</term> family selects mappings for characters with the
8th bit set, some of which are not mapped to any glyph for some ISO
8859 coded character sets.  The <term>US-ASCII</term> formatting
command specifies no mapping for any character with the 8th bit set,
and a reversal to the ISO 646:1991 IRV coded character set.  Whether
character set encoding is a good thing to do with formatting commands
is an open question.
<!>
<h2>Excerpt
<p>
<term>Excerpt</term> is specified in the draft to indicate that the
element content is excerpted from another source.  However, it is also
used for the references to such text, and for paraphrases, including
ellipses and contextual information in square brackets.  The idea
behind this formatting command is then not to facilitate automatic
referencing to the excerpted text, but to let the presentation of the
excerpted material stand out from the running text of the message
containing it.
<p>
As with the discussion of <term>heading</term> and
<term>footing</term>, <term>excerpt</term> may imply a
<term>reset</term> of all or some presentation parameters, and the
draft suggests an "alternate font", which, presumably, would be
untouched by the use of <term>bold</term> or <term>italic</term>
surrounding it.  Examples to the contrary have been given, so the
semantics of this element is a little fuzzy.
<p>
Other than this, this is clearly descriptive markup, and may in
particular be displayed as the display module sees fit, probably under
specific user control.
<!>
<h2>Paragraph
<p>
<term>Paragraph</term> is also clearly descriptive markup.  From the
description in the draft, the presentation is also entirely up to the
user.  The paragraph displayed will presumably be subject to
previously selected indentation, but the semantics of a paragraph
inside, e.g., <term>bold</term> or <term>smaller</term> element are
not clear.  <term>Paragraph</term> naturally occur within excerpted
material, but should be disallowed from headings and footings.  Other
considerations and constraints may also be placed on this element.
<p>
For user typing, the name may be a little on the verbose side.
Compared with the <term>nl</term> formatting command, which has a
short name because it's expected to be frequently used (I guess), the
choice of name may affect its adoption.  A short name, such as
<term>p</term>, may better serve it.
<!>
<h2>Signature
<p>
Clearly another item of descriptive markup, this element serves mainly
to delimit signatures from the rest of the message, and has no clear
presentational function.  Some readers may decide to suppress display
of signatures by default, and display them only upon request, for
which the typing inherent in calling something a "signature" will be
of great utility.
<!>
<h2>Comment
<p>
The <term>comment</term> formatting command is used to suppress
presentation of the content, and, as specified, also suppressed
recognition of markup in this content.  That is, further formatting
commands will not be recognized, with the exception of nested
<term>comment</term>s.  The general rule that formatting commands must
be balanced is set aside.  This means that one formatting command
works radically different from the other commands, and this may be
confusing to users, as well as increasing the complexity of richtext
interpreters.  <term>Comment</term> is also covered in <hdref
refid=syntax>.
<!>
<h2>No-op
<p>
The purpose of this formatting command is abstract.  It serves mainly
to delimit a block of text without changing any presentation
parameters.  All unknown formatting commands (i.e., extensions) will
map to the <term>no-op</term> command.  <term>No-op</term> is also
covered in <hdref refid=syntax>.
<!>
<h2>lt
<p>
The sole purpose of <term>lt</term> is to provide a mechanism to allow
literal "<" to occur in the text.  <term>lt</term> produces a
less-than sign in the current typeface.  <term>lt</term> is an
<emph>empty element</emph> in SGML parlance, and is also covered in
<hdref refid=syntax>.
<!>
<h2>nl
<p>
Like <term>lt</term>, the purpose of <term>nl</term> is to insert a
fixed string in the data stream.  The <term>nl</term> formatting
command represents a local new-line, i.e. a return to the first column
(as modified by the <term>indent</term> command), and starting a new
output presentation line, or, as it is commonly called, a "hard"
line-break, the "soft" line-break being implied by reaching the right
margin upon displaying the content of elements.  The <term>nl</term>
is an <emph>empty element</emph> in SGML parlance and is also covered
in <hdref refid=syntax>.
<!>
<h2>np
<p>
This formatting command is like <term>nl</term>, in that it introduces
a "hard" page-break to supplement the "soft" page-breaks which occur
when the presentation page is filled.  Presumably, the semantics of
the <term>np</term> command (and of "soft" page-breaks) is that the
contents of <term>footing</term> will be displayed, a new page
started, and contents of the most recent <term>heading</term> will be
displayed upon reaching an element which is not <term>heading</term>
(to allow for a <term>heading</term> immediately after the
<term>np</term>), or data.  The <term>np</term> formatting command
would intuitively be independent from the elements in which it occurs,
and could theoretically occur anywhere.  This may or may not be
desirable.  <term>np</term> is an <emph>empty element</emph> in SGML
parlance, and is also covered in <hdref refid=syntax>.
<!>
<![ IGNORE [To be continued.]]>
</body>
</opinion>

<!-- GNU Emacs support
Local variables:
mode: text
eval: (setq paragraph-separate
        (concat (setq paragraph-start "<\\(p\\|h[0-9]\\|!\\)>") "\n"))
eval: (load-file "typing-help.el")
End:
-->