Re: RT nested semantics

Dana S Emery <de19(_at_)umail(_dot_)umd(_dot_)edu> wrote:
|   
|   I leave the more complex commands to a later discourse, and remind
|   you of Erik Naggum's previous submission 'RichText and SGML', which
|   addreses the question of SGML interpretation of RT documents.
|   (thank you for the copy Erik).

I'm attaching a rudimentarily formatted version of the same report, for
the benefit of newcomers, and to refresh the memory of those who have
seen it before.  I haven't had time (or interest) to update it.  Maybe
if we get around to talk seriously about richtext, I'll continue what I
began.

BTW, I consider the richtext proponents' comments that "All x, where x
is any conceivable criticism of a particular richtext feature, is
terribly unimportant" a contradiction of the assertion that "richtext is
too important to be left out of MIME".

Best regards,
</Erik>












                           Richtext and SGML





                              Erik Naggum







                                ABSTRACT

      This paper discusses differences and similarities between
      the languages richtext (as specified in the Internet draft
      document draft-ietf-822ext-messagebodies-03) and SGML (as
      specified in ISO 8879:1986).  Several issues concerning the
      semantics of the language is included in this discussion, as
      they will also need to be addressed in designing an SGML
      document type.  A summary of changes to richtext necessary
      to make it SGML-compliant concludes the paper.

























DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                    2
Introduction



1   INTRODUCTION

    The Internet Engineering Task Force (IETF) Working Group on RFC
    822 Extensions (822ext) has collectively prepared an Internet
    Draft, {MIME (Multipurpose Internet Mail Extensions)} edited by
    Nathaniel Borenstein and Ned Freed.  MIME contains provisions
    for an "enriched" text format, called richtext as a subtype of
    the |text| message body type.

    The working group has decided that richtext be SGML compliant,
    but significantly simpler than full SGML.  This has predictably
    produced a simple markup language with distinct SGML flavor, but
    the language defined is also departing from SGML in significant
    ways.  Although several SGML-knowledgable people participate in
    the working group, the issue of compliance with SGML has not
    been given full consideration.  It is the objective of the
    present paper to provide this.

    This paper is divided into several sections.  Section 2 lists
    differences in terms and concepts which serve to separate
    richtext conceptually from SGML, specifically in terms of
    procedural and generic markup.  Section 3 deals with the lexical
    level of the two languages, and details differences and
    similarities.  Section 4 compares the semantics of the richtext
    formatting commands.  Finally, Section 5 concludes the paper
    with specific recommendations for the draft, as well as
    suggesting a possible document type definition (DTD) for
    richtext, which aims at preserving the semantics of the
    language, as it is discussed in Section 4.



2   PROCEDURAL vs DESCRIPTIVE MARKUP

    A number of concepts are different between richtext and SGML.
    Richtext is essentially a procedural markup language, in which
    the markup consists of relatively simple and "hard-wired"
    instructions for a formatter, with a clearly defined meaning and
    result.  SGML is a descriptive (also known as generic or
    generalized, each with further qualifications) markup language,
    the basic premise of which is that the presentation of a given
    piece of information depends on its type (about which the markup
    provides information), that all pieces of the same type should
    be given the same presentation, and that for any given type,
    _some_ presentation must be chosen, but it can be _any_
    presentation.  Generic markup languages exhibit the quality that
    their markup is rigorous, and that there is a structure to the
    document which it shares with other documents of the same
    document type.  The task of generic markup languages comprise
    both the specification of the document type, the markup language
    used in it, and the syntax for document conforming to the type.




DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                    3
Procedural vs Descriptive Markup



    Specifically, richtext speaks of "formatting commands", which
    stem from its procedural markup language heritage.  A
    "formatting command" gives an indication of how the following
    text is to be formatted and presented to the reader of a
    richtext message.  Richtext formatting commands come in pairs,
    effectively an enabling and a disabling command for each
    formatting instruction.  An enabling formatting command has the
    general syntax, "<", command name, ">", and the disabling
    formatting command the general syntax "</", command name, ">".

    SGML has the notion of an element named and accessed by its
    generic identifier.  An element in SGML consists of three parts,
    the start-tag, the content, and the end-tag.  (There are also
    empty elements, which only have start-tags.)  A start-tag has
    the general syntax, "<", generic identifier, ">", and an end tag
    the general syntax, "</", generic identifier, ">".
    Superficially, therefore, the two languages have the same
    syntax.  However, in SGML, the tags serve to delimit the content
    of an element, whereas in richtext they serve primarily to turn
    on and off formatting or presentation functions.

    The difference between element and formatted text can also be
    found in the description of their content, and specifically of
    nested elements.  Richtext specifies that the enabling and
    disabling formatting commands must match, but takes exception
    from this rule with one particular command (|comment|).  SGML
    specifies that element content can either be sub-elements or
    data (i.e., text), or both.  In the case of a sub-element, it is
    wholly contained in the parent element, and both the semantics
    and the syntax of the language prohibit the opposite.

    Yet, with this important conceptual difference between the
    languages, it is possible, indeed quite simple, to describe
    richtext in terms of SGML.  A few compromises have to be made,
    primarily in the lowest level (lexical) and the highest leve
    (semantics), and they will be detailed below.

    For the purposes of simplifying the following discussion, I will
    use the term "element" to denote both an SGML element and the
    tuple (enabling formatting command, content, disabling
    formatting command) in richtext.  The term "start-tag" will be
    used to refer to both start-tags in SGML and enabling formatting
    commands in richtext.  Likewise, "end-tag" will refer to both
    SGML end-tags and disabling formatting commands.  The term "tag"
    will then be used to denote either start- or end-tags, as fits
    the context.

    Another difference resulting from the procedural versus generic
    markup model is that there is no difference between the function
    of a formatting command which produces data on its own and one
    which only modifies the presentation of if its contents.
    richtext has the same syntax and representation for these two,
    whereas SGML have them separated.


DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                    4
Lexical Similarities and Differences



3   LEXICAL SIMILARITIES AND DIFFERENCES

    This section documents differences between the exact meaning and
    lexical value of delimiters used in the both richtext and SGML.


3.1 Delimiters

    As mentioned above, both richtext and SGML use the same
    delimiters around their "tags".  Much of the complexity of SGML
    lies in its lexical rules, and the ways in which it recognizes
    markup in text.  Richtext will always treat "<" as the start of
    a formatting command, regardless of what follows it.  In SGML, a
    "<" is not recognized as a "start-tag open" unless it is
    followed by a character which could start a generic identifier.
    The primary reason for this is that human readers of an SGML
    document will always be able to see from context whether a "<"
    opens a start-tag or is meant to be the literal less-than sign.

    In richtext, therefore, a literal less-than sign will always
    have to be encoded as a special formatting command which
    produces data (a literal less-than sign!), e.g. as "<lt>".  In
    SGML, this is markup of a different breed, and it is represented
    as "&lt;".  To make an SGML parser ignore a "<", it is only
    necessary to let it precede a character which can not follow it
    in valid SGML markup, e.g. a space character.  Since "<" is
    always special in richtext, error handling code must be written
    to survive stray "<"s in the text.


3.2 Length of Tags

    The maximum length of tags is another difference between
    richtext and SGML.  Richtext specifies that the maximum length
    of formatting commands is 40 characters, excluding the
    delimiters.  The minimum maximum length of generic identifiers
    in SGML is eight (8) characters, and although this is deemed by
    many to be ridiculously low, this is the value specified in the
    standard.  The extra 32 characters allowed by richtext will be
    an error if the same document is parsed in an SGML context.


3.3 Miscellanous Syntax of Generic Identifiers and Tags

    In SGML, a generic identifier consists of an initial letter,
    followed by alphanumerics, hyphens or periods.  Richtext appears
    to allow digits and hyphen as the first character in the name of
    a formatting command.  A formatting command starting with a
    hyphen or digit will not be recognized as markup in an SGML
    parsing context, and will remain data; thus not being an error,
    but still causing results different from the intended.




DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                    5
Lexical Similarities and Differences



    SGML allows the "tag close" delimiter (">") to be preceded by
    white space.  In richtext, this would be an error.  However,
    richtext parsed in an SGML context would not be erroneous in
    this respect.

    SGML allows attributes in start-tags, to function as information
    about the element while not being part of its content.  The
    generic identifier is separated from the attribute list by white
    space, as are the individual items in the attribute list.  The
    attribute list consists of attribute values from a list of
    tokens, or as pairs of attribute name and value, separated by
    "=", optionally surrounded by white space.  Should there be a
    white space, or a missing ">" in a richtext formatting command,
    a large number of errors will ensue, as subsequent words will be
    attempted parsed as attribute values for non-existent
    attributes.  (Both richtext and SGML will thus misbehave in the
    presence of unintended white space in tags -- so this issue will
    have to be given special attention in richtext conformance
    clauses, with some rule about the responsibility of sending
    systems to ascertain conformance of the document.)


3.4 Lines and Attendant Problems

    SGML as a language does not know about lines, or "records", as
    they're known in the standard, but does support their existence,
    and means to recognize them.  A "record" in SGML parlance is
    delimited by a Record Start (RS) and a Record End (RE)
    character.  These characters are virtual, but are represented by
    ASCII Line Feed (LF, 0x0a, '\n', etc), and Carriage Return (CR,
    0x0d, '\r', etc), respectively.  With a CRLF-delimited sequence
    of lines, this maps intuitively to line breaks.  There are
    special problems associated with handling records, and the most
    important solution to this problem in SGML, is to ignore a
    Record End if it comes right after a start-tag, or right before
    an end-tag.  (This is a simplification of the complete set of
    rules, but true under general conditions, such as when comparing
    richtext and SGML.)

    The Record Start is consistently ignored in SGML, unless some
    special provisions are introduced to interpret it as markup (a
    description of this feature is omitted for brevity), but the
    Record End is, when it is not ignored, treated as data in its
    own right.  It is up to the SGML application to decide whether
    it's to be treated as a space or as a line break.  (Generally,
    certain elements, such as paragraphs, will produce breaks and
    separating vertical space around their content, which serves the
    purpose of a "hard" line break in word processors.)

    In richtext, a CRLF is deemed to be an alternate representation
    of a space at all levels of the markup, and there is no way to
    specify an alternate behavior only in some elements.  This is
    what would be done in an SGML application if the content was


DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                    6
Semantics of Richtext Formatting Commands



    data.  The present draft is unclear on whether CRLF's occurring
    immediately next to formatting commands will be interpreted as
    space or ignored.  The specification implies that a sequence of
    CRLFs will be converted to one space character, regardless of
    its length and location.  In lieu of the discussion on the list
    on interpreting N CRLFs in sequence as meaning (N-1) "hard"
    CRLFs, this point may need to be revisited.



4   SEMANTICS OF RICHTEXT FORMATTING COMMANDS

    Richtext uses a variety of formatting commands to achieve a
    number of typographical functions, and there are several classes
    of the semantics involved, without documentation in the draft.
    The following attempt at a description builds on examples
    published on the list, as well as an intuitive understanding of
    the functions desired, either of which may be wrong.

    Although not precisely SGML issues, the following discussion
    stresses the issues of semantics of the formatting commands that
    would need to be formalized in an SGML document type definition.
    This is thus only serving to provide a basis from which to build
    an SGML application.


4.1 Bold, Italic

    The contents of the |bold| (|italic|) element will be displayed
    with a boldface (italic) font.  The function seems to be a
    modifier of whatever is the _current typeface_, in that this
    should be emboldened (italicized), as opposed to selecting a
    predefined "boldface" ("italic") font to replace it.

    Unlike the |smaller| and |bigger| elements, neither |bold| nor
    |italic| seem to be additive, that is make the text more bold or
    more italic.


4.2 Fixed

    |Fixed| seems to be different from both |bold| and |italic| in
    that it selects a different typeface than the current.  It it is
    not clear whether the effects of |bold| and |italic| are
    retained inside |fixed|.

    |Fixed| is obviously not nestable.


4.3 Smaller, Bigger

    |Smaller| (|bigger|) seem, like |bold| and |italic| to modify
    the _current typeface_, and thus only to reduce (increase) its


DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                    7
Semantics of Richtext Formatting Commands



    point-size.  These commands seem to be additive, in that nested
    elements further reduce (increase) the point-size of the current
    typeface.


4.4 Underline

    This element type has not been used widely enough to warrant
    conclusive comments, but the question remains whether
    |underline| is additive, i.e. that nested |underline| elements
    produce additional lines under its contents.

    Underlining is usually used as a replacement for italics in the
    absence of the latter, so this may be an issue of rendition for
    different output devices, and the two may in practice be merged
    to one for any given output device, unless it supports both.


4.5 Center, FlushLeft, FlushRight

    These elements have intuitive functions as long as they are
    mutually exclusive.  The current draft says nothing about
    nesting them.  If this is attempted, several interesting things
    might happen, depending on how they are implemented.  One
    possible interpretation is that these commands imply a nested
    |paragraph| element.

    It is not clear whether the contents of a flushright element
    inside a flushleft is justified to both margins, or whether one
    of them take precedence over the other.


4.6 Indent, IndentRight

    The intuitive function of these commands is to set off the left
    (right) margin, and it's also intuitive that both are additive,
    and not mutually exclusive.  To indent from both left and right
    margin, one would nest |indent| within |indentright|, for
    example.  A paragraph or line with some text prior to the
    |indent| command might be interpreted as a hanging indent.


4.7 Outdent, OutdentRight

    I don't understand what the intended purpose of these is, so
    I'll defer comments on them until I do.  And although the names
    are intuitively understandable next to "indent", they are not
    good names for the opposite of "indent".


4.8 Samepage

    This element can have many functions, but I haven't seen it


DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                    8
Semantics of Richtext Formatting Commands



    used, so the following is speculation.  One function may be to
    prevent figures, examples, and tables from being split across
    pages.  Another function may be to preempt the need for
    widow/orphan logic in the presentation module.  (A "widow" is a
    single line from a paragraph separated from it by a page break.
    An "orphan" is a title for a headed paragraph separated from the
    paragraph text by a page break.)  However, these functions
    should be addressed individually.  In a generic markup language,
    one would tend to have a concept of "headed section", of which
    the title be an integral part.  This higher-level "unit" is lost
    to procedural markup languages.  |Samepage| is not likely to
    nest, but what content should be expected inside it is hard to
    predict.


4.9 Subscript, Superscript

    The presentation associated with these are obvious: a reduction
    in point size, and a vertical movement of the baseline.  The
    uses for sub- and superscripting are many, including chemical
    formulae, array and matrix indices, footnote indicators, area
    and volume units, power, and more.  Depending on the
    presentation medium, the current typeface may have separate
    characters for some, if not all, superscripted digits, but may
    have to do more work to present the full range of super- and
    subscripted text.

    There is presently no provision for footnotes in richtext, so
    these elements may conceivably be used to present their
    indicators, and |smaller| to mark up their text.  This may be a
    serious problem, since footnotes on display terminals may be
    given radically different presentations than would footnotes in
    printed form, even though printed footnotes are occationally
    also found in the margin of the text.

    For mathematical formulae, it is occationally necessary to allow
    superscripted superscripts.  It is therefore not reasonable to
    disallow nesting of these elements, but only of the type of the
    outermost element.


4.10 Heading, Footing

    Both of these are inherently page-oriented, and since I haven't
    seen them used, I can only conjecture how they would be used.
    For a page-layout system, a page is divided into three or more
    parts, including at the least heading, text area, and footing.
    The specification of the heading will have to occur before any
    of the text in the text area, if it should apply to the first
    page, and a |heading| found after the start of the text area
    would then be construed to apply only to subsequent pages, if
    any.  (It might therefore be reasonable to use a new |heading|
    after each title of major sections, so as to let the title of a


DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                    9
Semantics of Richtext Formatting Commands



    continued headed section be displayed in the page heading.)  The
    same applies to |footing|, although to a lesser degree.  The
    question of different headings and footings for verso and recto
    pages is not addressed.  Likewise, page numbering issues should
    be addressed.

    The function of a |heading| or |footing| element would be to
    store the content for later presentation as page breaks occur.
    Since some presentation devices will not be germane to paged
    display, these elements should perhaps be suppressed in the
    minimal richtext interpreter (Appendix D in the draft), rather
    than be displayed where they are found, as this might interfere
    with the running text.

    Then there is the effect of the current typeface, as modified by
    several other formatting commands, as well as issues of margins
    with respect to indented text.  Headings and footings are
    traditionally processed in a different environment than the
    running text, and with separate line length and positioning
    directives.  There might therefore be a need for a |reset|
    formatting command which would reset all states to a system
    default for its contents, unless the semantics of |heading| and
    |footing| already subsume this function.

    Obviously, neither |heading| nor |footing| nest.


4.11 ISO-8859-X, US-ASCII

    |ISO-8859-X| is actually a family of formatting commands, valid
    for each defined value of |X|.  |US-ASCII| and |ISO-8859-X| both
    modify the mapping from coded character to glyph performed by
    the display module.  The |ISO-8859| family selects mappings for
    characters with the 8th bit set, some of which are not mapped to
    any glyph for some ISO 8859 coded character sets.  The
    |US-ASCII| formatting command specifies no mapping for any
    character with the 8th bit set, and a reversal to the ISO
    646:1991 IRV coded character set.  Whether character set
    encoding is a good thing to do with formatting commands is an
    open question.


4.12 Excerpt

    |Excerpt| is specified in the draft to indicate that the element
    content is excerpted from another source.  However, it is also
    used for the references to such text, and for paraphrases,
    including ellipses and contextual information in square
    brackets.  The idea behind this formatting command is then not
    to facilitate automatic referencing to the excerpted text, but
    to let the presentation of the excerpted material stand out from
    the running text of the message containing it.



DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                   10
Semantics of Richtext Formatting Commands



    As with the discussion of |heading| and |footing|, |excerpt| may
    imply a |reset| of all or some presentation parameters, and the
    draft suggests an "alternate font", which, presumably, would be
    untouched by the use of |bold| or |italic| surrounding it.
    Examples to the contrary have been given, so the semantics of
    this element is a little fuzzy.

    Other than this, this is clearly descriptive markup, and may in
    particular be displayed as the display module sees fit, probably
    under specific user control.


4.13 Paragraph

    |Paragraph| is also clearly descriptive markup.  From the
    description in the draft, the presentation is also entirely up
    to the user.  The paragraph displayed will presumably be subject
    to previously selected indentation, but the semantics of a
    paragraph inside, e.g., |bold| or |smaller| element are not

    clear.  |Paragraph| naturally occur within excerpted material,
    but should be disallowed from headings and footings.  Other
    considerations and constraints may also be placed on this
    element.

    For user typing, the name may be a little on the verbose side.
    Compared with the |nl| formatting command, which has a short
    name because it's expected to be frequently used (I guess), the
    choice of name may affect its adoption.  A short name, such as
    |p|, may better serve it.


4.14 Signature

    Clearly another item of descriptive markup, this element serves
    mainly to delimit signatures from the rest of the message, and
    has no clear presentational function.  Some readers may decide
    to suppress display of signatures by default, and display them
    only upon request, for which the typing inherent in calling
    something a "signature" will be of great utility.


4.15 Comment

    The |comment| formatting command is used to suppress
    presentation of the content, and, as specified, also suppressed
    recognition of markup in this content.  That is, further
    formatting commands will not be recognized, with the exception
    of nested |comment|s.  The general rule that formatting commands
    must be balanced is set aside.  This means that one formatting
    command works radically different from the other commands, and
    this may be confusing to users, as well as increasing the
    complexity of richtext interpreters.


DRAFT version 4.                                            1992-02-13

Richtext and SGML                                                   11
Semantics of Richtext Formatting Commands



4.16 No-op

    The purpose of this formatting command is abstract.  It serves
    mainly to delimit a block of text without changing any
    presentation parameters.  All unknown formatting commands (i.e.,
    extensions) will map to the |no-op| command.


4.17 lt

    The sole purpose of |lt| is to provide a mechanism to allow
    literal "<" to occur in the text.  |lt| produces a less-than
    sign in the current typeface.  |lt| is an _empty element_ in
    SGML parlance.


4.18 nl

    Like |lt|, the purpose of |nl| is to insert a fixed string in
    the data stream.  The |nl| formatting command represents a local
    new-line, i.e. a return to the first column (as modified by the
    |indent| command), and starting a new output presentation line,
    or, as it is commonly called, a "hard" line-break, the "soft"
    line-break being implied by reaching the right margin upon
    displaying the content of elements.  The |nl| is an _empty
    element_ in SGML parlance.


4.19 np

    This formatting command is like |nl|, in that it introduces a
    "hard" page-break to supplement the "soft" page-breaks which
    occur when the presentation page is filled.  Presumably, the
    semantics of the |np| command (and of "soft" page-breaks) is
    that the contents of |footing| will be displayed, a new page
    started, and contents of the most recent |heading| will be
    displayed upon reaching an element which is not |heading| (to
    allow for a |heading| immediately after the |np|), or data.  The
    |np| formatting command would intuitively be independent from
    the elements in which it occurs, and could theoretically occur
    anywhere.  This may or may not be desirable.  |np| is an _empty
    element_ in SGML parlance.




    Author's address

    Erik Naggum
    Naggum Software
    Boks 1570 Vika
    0118  OSLO
    NORWAY


DRAFT version 4.                                            1992-02-13