Dana S Emery <de19(_at_)umail(_dot_)umd(_dot_)edu> wrote:
|
| I leave the more complex commands to a later discourse, and remind
| you of Erik Naggum's previous submission 'RichText and SGML', which
| addreses the question of SGML interpretation of RT documents.
| (thank you for the copy Erik).
I'm attaching a rudimentarily formatted version of the same report, for
the benefit of newcomers, and to refresh the memory of those who have
seen it before. I haven't had time (or interest) to update it. Maybe
if we get around to talk seriously about richtext, I'll continue what I
began.
BTW, I consider the richtext proponents' comments that "All x, where x
is any conceivable criticism of a particular richtext feature, is
terribly unimportant" a contradiction of the assertion that "richtext is
too important to be left out of MIME".
Best regards,
</Erik>
Richtext and SGML
Erik Naggum
ABSTRACT
This paper discusses differences and similarities between
the languages richtext (as specified in the Internet draft
document draft-ietf-822ext-messagebodies-03) and SGML (as
specified in ISO 8879:1986). Several issues concerning the
semantics of the language is included in this discussion, as
they will also need to be addressed in designing an SGML
document type. A summary of changes to richtext necessary
to make it SGML-compliant concludes the paper.
DRAFT version 4. 1992-02-13
Richtext and SGML 2
Introduction
1 INTRODUCTION
The Internet Engineering Task Force (IETF) Working Group on RFC
822 Extensions (822ext) has collectively prepared an Internet
Draft, {MIME (Multipurpose Internet Mail Extensions)} edited by
Nathaniel Borenstein and Ned Freed. MIME contains provisions
for an "enriched" text format, called richtext as a subtype of
the |text| message body type.
The working group has decided that richtext be SGML compliant,
but significantly simpler than full SGML. This has predictably
produced a simple markup language with distinct SGML flavor, but
the language defined is also departing from SGML in significant
ways. Although several SGML-knowledgable people participate in
the working group, the issue of compliance with SGML has not
been given full consideration. It is the objective of the
present paper to provide this.
This paper is divided into several sections. Section 2 lists
differences in terms and concepts which serve to separate
richtext conceptually from SGML, specifically in terms of
procedural and generic markup. Section 3 deals with the lexical
level of the two languages, and details differences and
similarities. Section 4 compares the semantics of the richtext
formatting commands. Finally, Section 5 concludes the paper
with specific recommendations for the draft, as well as
suggesting a possible document type definition (DTD) for
richtext, which aims at preserving the semantics of the
language, as it is discussed in Section 4.
2 PROCEDURAL vs DESCRIPTIVE MARKUP
A number of concepts are different between richtext and SGML.
Richtext is essentially a procedural markup language, in which
the markup consists of relatively simple and "hard-wired"
instructions for a formatter, with a clearly defined meaning and
result. SGML is a descriptive (also known as generic or
generalized, each with further qualifications) markup language,
the basic premise of which is that the presentation of a given
piece of information depends on its type (about which the markup
provides information), that all pieces of the same type should
be given the same presentation, and that for any given type,
_some_ presentation must be chosen, but it can be _any_
presentation. Generic markup languages exhibit the quality that
their markup is rigorous, and that there is a structure to the
document which it shares with other documents of the same
document type. The task of generic markup languages comprise
both the specification of the document type, the markup language
used in it, and the syntax for document conforming to the type.
DRAFT version 4. 1992-02-13
Richtext and SGML 3
Procedural vs Descriptive Markup
Specifically, richtext speaks of "formatting commands", which
stem from its procedural markup language heritage. A
"formatting command" gives an indication of how the following
text is to be formatted and presented to the reader of a
richtext message. Richtext formatting commands come in pairs,
effectively an enabling and a disabling command for each
formatting instruction. An enabling formatting command has the
general syntax, "<", command name, ">", and the disabling
formatting command the general syntax "</", command name, ">".
SGML has the notion of an element named and accessed by its
generic identifier. An element in SGML consists of three parts,
the start-tag, the content, and the end-tag. (There are also
empty elements, which only have start-tags.) A start-tag has
the general syntax, "<", generic identifier, ">", and an end tag
the general syntax, "</", generic identifier, ">".
Superficially, therefore, the two languages have the same
syntax. However, in SGML, the tags serve to delimit the content
of an element, whereas in richtext they serve primarily to turn
on and off formatting or presentation functions.
The difference between element and formatted text can also be
found in the description of their content, and specifically of
nested elements. Richtext specifies that the enabling and
disabling formatting commands must match, but takes exception
from this rule with one particular command (|comment|). SGML
specifies that element content can either be sub-elements or
data (i.e., text), or both. In the case of a sub-element, it is
wholly contained in the parent element, and both the semantics
and the syntax of the language prohibit the opposite.
Yet, with this important conceptual difference between the
languages, it is possible, indeed quite simple, to describe
richtext in terms of SGML. A few compromises have to be made,
primarily in the lowest level (lexical) and the highest leve
(semantics), and they will be detailed below.
For the purposes of simplifying the following discussion, I will
use the term "element" to denote both an SGML element and the
tuple (enabling formatting command, content, disabling
formatting command) in richtext. The term "start-tag" will be
used to refer to both start-tags in SGML and enabling formatting
commands in richtext. Likewise, "end-tag" will refer to both
SGML end-tags and disabling formatting commands. The term "tag"
will then be used to denote either start- or end-tags, as fits
the context.
Another difference resulting from the procedural versus generic
markup model is that there is no difference between the function
of a formatting command which produces data on its own and one
which only modifies the presentation of if its contents.
richtext has the same syntax and representation for these two,
whereas SGML have them separated.
DRAFT version 4. 1992-02-13
Richtext and SGML 4
Lexical Similarities and Differences
3 LEXICAL SIMILARITIES AND DIFFERENCES
This section documents differences between the exact meaning and
lexical value of delimiters used in the both richtext and SGML.
3.1 Delimiters
As mentioned above, both richtext and SGML use the same
delimiters around their "tags". Much of the complexity of SGML
lies in its lexical rules, and the ways in which it recognizes
markup in text. Richtext will always treat "<" as the start of
a formatting command, regardless of what follows it. In SGML, a
"<" is not recognized as a "start-tag open" unless it is
followed by a character which could start a generic identifier.
The primary reason for this is that human readers of an SGML
document will always be able to see from context whether a "<"
opens a start-tag or is meant to be the literal less-than sign.
In richtext, therefore, a literal less-than sign will always
have to be encoded as a special formatting command which
produces data (a literal less-than sign!), e.g. as "<lt>". In
SGML, this is markup of a different breed, and it is represented
as "<". To make an SGML parser ignore a "<", it is only
necessary to let it precede a character which can not follow it
in valid SGML markup, e.g. a space character. Since "<" is
always special in richtext, error handling code must be written
to survive stray "<"s in the text.
3.2 Length of Tags
The maximum length of tags is another difference between
richtext and SGML. Richtext specifies that the maximum length
of formatting commands is 40 characters, excluding the
delimiters. The minimum maximum length of generic identifiers
in SGML is eight (8) characters, and although this is deemed by
many to be ridiculously low, this is the value specified in the
standard. The extra 32 characters allowed by richtext will be
an error if the same document is parsed in an SGML context.
3.3 Miscellanous Syntax of Generic Identifiers and Tags
In SGML, a generic identifier consists of an initial letter,
followed by alphanumerics, hyphens or periods. Richtext appears
to allow digits and hyphen as the first character in the name of
a formatting command. A formatting command starting with a
hyphen or digit will not be recognized as markup in an SGML
parsing context, and will remain data; thus not being an error,
but still causing results different from the intended.
DRAFT version 4. 1992-02-13
Richtext and SGML 5
Lexical Similarities and Differences
SGML allows the "tag close" delimiter (">") to be preceded by
white space. In richtext, this would be an error. However,
richtext parsed in an SGML context would not be erroneous in
this respect.
SGML allows attributes in start-tags, to function as information
about the element while not being part of its content. The
generic identifier is separated from the attribute list by white
space, as are the individual items in the attribute list. The
attribute list consists of attribute values from a list of
tokens, or as pairs of attribute name and value, separated by
"=", optionally surrounded by white space. Should there be a
white space, or a missing ">" in a richtext formatting command,
a large number of errors will ensue, as subsequent words will be
attempted parsed as attribute values for non-existent
attributes. (Both richtext and SGML will thus misbehave in the
presence of unintended white space in tags -- so this issue will
have to be given special attention in richtext conformance
clauses, with some rule about the responsibility of sending
systems to ascertain conformance of the document.)
3.4 Lines and Attendant Problems
SGML as a language does not know about lines, or "records", as
they're known in the standard, but does support their existence,
and means to recognize them. A "record" in SGML parlance is
delimited by a Record Start (RS) and a Record End (RE)
character. These characters are virtual, but are represented by
ASCII Line Feed (LF, 0x0a, '\n', etc), and Carriage Return (CR,
0x0d, '\r', etc), respectively. With a CRLF-delimited sequence
of lines, this maps intuitively to line breaks. There are
special problems associated with handling records, and the most
important solution to this problem in SGML, is to ignore a
Record End if it comes right after a start-tag, or right before
an end-tag. (This is a simplification of the complete set of
rules, but true under general conditions, such as when comparing
richtext and SGML.)
The Record Start is consistently ignored in SGML, unless some
special provisions are introduced to interpret it as markup (a
description of this feature is omitted for brevity), but the
Record End is, when it is not ignored, treated as data in its
own right. It is up to the SGML application to decide whether
it's to be treated as a space or as a line break. (Generally,
certain elements, such as paragraphs, will produce breaks and
separating vertical space around their content, which serves the
purpose of a "hard" line break in word processors.)
In richtext, a CRLF is deemed to be an alternate representation
of a space at all levels of the markup, and there is no way to
specify an alternate behavior only in some elements. This is
what would be done in an SGML application if the content was
DRAFT version 4. 1992-02-13
Richtext and SGML 6
Semantics of Richtext Formatting Commands
data. The present draft is unclear on whether CRLF's occurring
immediately next to formatting commands will be interpreted as
space or ignored. The specification implies that a sequence of
CRLFs will be converted to one space character, regardless of
its length and location. In lieu of the discussion on the list
on interpreting N CRLFs in sequence as meaning (N-1) "hard"
CRLFs, this point may need to be revisited.
4 SEMANTICS OF RICHTEXT FORMATTING COMMANDS
Richtext uses a variety of formatting commands to achieve a
number of typographical functions, and there are several classes
of the semantics involved, without documentation in the draft.
The following attempt at a description builds on examples
published on the list, as well as an intuitive understanding of
the functions desired, either of which may be wrong.
Although not precisely SGML issues, the following discussion
stresses the issues of semantics of the formatting commands that
would need to be formalized in an SGML document type definition.
This is thus only serving to provide a basis from which to build
an SGML application.
4.1 Bold, Italic
The contents of the |bold| (|italic|) element will be displayed
with a boldface (italic) font. The function seems to be a
modifier of whatever is the _current typeface_, in that this
should be emboldened (italicized), as opposed to selecting a
predefined "boldface" ("italic") font to replace it.
Unlike the |smaller| and |bigger| elements, neither |bold| nor
|italic| seem to be additive, that is make the text more bold or
more italic.
4.2 Fixed
|Fixed| seems to be different from both |bold| and |italic| in
that it selects a different typeface than the current. It it is
not clear whether the effects of |bold| and |italic| are
retained inside |fixed|.
|Fixed| is obviously not nestable.
4.3 Smaller, Bigger
|Smaller| (|bigger|) seem, like |bold| and |italic| to modify
the _current typeface_, and thus only to reduce (increase) its
DRAFT version 4. 1992-02-13
Richtext and SGML 7
Semantics of Richtext Formatting Commands
point-size. These commands seem to be additive, in that nested
elements further reduce (increase) the point-size of the current
typeface.
4.4 Underline
This element type has not been used widely enough to warrant
conclusive comments, but the question remains whether
|underline| is additive, i.e. that nested |underline| elements
produce additional lines under its contents.
Underlining is usually used as a replacement for italics in the
absence of the latter, so this may be an issue of rendition for
different output devices, and the two may in practice be merged
to one for any given output device, unless it supports both.
4.5 Center, FlushLeft, FlushRight
These elements have intuitive functions as long as they are
mutually exclusive. The current draft says nothing about
nesting them. If this is attempted, several interesting things
might happen, depending on how they are implemented. One
possible interpretation is that these commands imply a nested
|paragraph| element.
It is not clear whether the contents of a flushright element
inside a flushleft is justified to both margins, or whether one
of them take precedence over the other.
4.6 Indent, IndentRight
The intuitive function of these commands is to set off the left
(right) margin, and it's also intuitive that both are additive,
and not mutually exclusive. To indent from both left and right
margin, one would nest |indent| within |indentright|, for
example. A paragraph or line with some text prior to the
|indent| command might be interpreted as a hanging indent.
4.7 Outdent, OutdentRight
I don't understand what the intended purpose of these is, so
I'll defer comments on them until I do. And although the names
are intuitively understandable next to "indent", they are not
good names for the opposite of "indent".
4.8 Samepage
This element can have many functions, but I haven't seen it
DRAFT version 4. 1992-02-13
Richtext and SGML 8
Semantics of Richtext Formatting Commands
used, so the following is speculation. One function may be to
prevent figures, examples, and tables from being split across
pages. Another function may be to preempt the need for
widow/orphan logic in the presentation module. (A "widow" is a
single line from a paragraph separated from it by a page break.
An "orphan" is a title for a headed paragraph separated from the
paragraph text by a page break.) However, these functions
should be addressed individually. In a generic markup language,
one would tend to have a concept of "headed section", of which
the title be an integral part. This higher-level "unit" is lost
to procedural markup languages. |Samepage| is not likely to
nest, but what content should be expected inside it is hard to
predict.
4.9 Subscript, Superscript
The presentation associated with these are obvious: a reduction
in point size, and a vertical movement of the baseline. The
uses for sub- and superscripting are many, including chemical
formulae, array and matrix indices, footnote indicators, area
and volume units, power, and more. Depending on the
presentation medium, the current typeface may have separate
characters for some, if not all, superscripted digits, but may
have to do more work to present the full range of super- and
subscripted text.
There is presently no provision for footnotes in richtext, so
these elements may conceivably be used to present their
indicators, and |smaller| to mark up their text. This may be a
serious problem, since footnotes on display terminals may be
given radically different presentations than would footnotes in
printed form, even though printed footnotes are occationally
also found in the margin of the text.
For mathematical formulae, it is occationally necessary to allow
superscripted superscripts. It is therefore not reasonable to
disallow nesting of these elements, but only of the type of the
outermost element.
4.10 Heading, Footing
Both of these are inherently page-oriented, and since I haven't
seen them used, I can only conjecture how they would be used.
For a page-layout system, a page is divided into three or more
parts, including at the least heading, text area, and footing.
The specification of the heading will have to occur before any
of the text in the text area, if it should apply to the first
page, and a |heading| found after the start of the text area
would then be construed to apply only to subsequent pages, if
any. (It might therefore be reasonable to use a new |heading|
after each title of major sections, so as to let the title of a
DRAFT version 4. 1992-02-13
Richtext and SGML 9
Semantics of Richtext Formatting Commands
continued headed section be displayed in the page heading.) The
same applies to |footing|, although to a lesser degree. The
question of different headings and footings for verso and recto
pages is not addressed. Likewise, page numbering issues should
be addressed.
The function of a |heading| or |footing| element would be to
store the content for later presentation as page breaks occur.
Since some presentation devices will not be germane to paged
display, these elements should perhaps be suppressed in the
minimal richtext interpreter (Appendix D in the draft), rather
than be displayed where they are found, as this might interfere
with the running text.
Then there is the effect of the current typeface, as modified by
several other formatting commands, as well as issues of margins
with respect to indented text. Headings and footings are
traditionally processed in a different environment than the
running text, and with separate line length and positioning
directives. There might therefore be a need for a |reset|
formatting command which would reset all states to a system
default for its contents, unless the semantics of |heading| and
|footing| already subsume this function.
Obviously, neither |heading| nor |footing| nest.
4.11 ISO-8859-X, US-ASCII
|ISO-8859-X| is actually a family of formatting commands, valid
for each defined value of |X|. |US-ASCII| and |ISO-8859-X| both
modify the mapping from coded character to glyph performed by
the display module. The |ISO-8859| family selects mappings for
characters with the 8th bit set, some of which are not mapped to
any glyph for some ISO 8859 coded character sets. The
|US-ASCII| formatting command specifies no mapping for any
character with the 8th bit set, and a reversal to the ISO
646:1991 IRV coded character set. Whether character set
encoding is a good thing to do with formatting commands is an
open question.
4.12 Excerpt
|Excerpt| is specified in the draft to indicate that the element
content is excerpted from another source. However, it is also
used for the references to such text, and for paraphrases,
including ellipses and contextual information in square
brackets. The idea behind this formatting command is then not
to facilitate automatic referencing to the excerpted text, but
to let the presentation of the excerpted material stand out from
the running text of the message containing it.
DRAFT version 4. 1992-02-13
Richtext and SGML 10
Semantics of Richtext Formatting Commands
As with the discussion of |heading| and |footing|, |excerpt| may
imply a |reset| of all or some presentation parameters, and the
draft suggests an "alternate font", which, presumably, would be
untouched by the use of |bold| or |italic| surrounding it.
Examples to the contrary have been given, so the semantics of
this element is a little fuzzy.
Other than this, this is clearly descriptive markup, and may in
particular be displayed as the display module sees fit, probably
under specific user control.
4.13 Paragraph
|Paragraph| is also clearly descriptive markup. From the
description in the draft, the presentation is also entirely up
to the user. The paragraph displayed will presumably be subject
to previously selected indentation, but the semantics of a
paragraph inside, e.g., |bold| or |smaller| element are not
clear. |Paragraph| naturally occur within excerpted material,
but should be disallowed from headings and footings. Other
considerations and constraints may also be placed on this
element.
For user typing, the name may be a little on the verbose side.
Compared with the |nl| formatting command, which has a short
name because it's expected to be frequently used (I guess), the
choice of name may affect its adoption. A short name, such as
|p|, may better serve it.
4.14 Signature
Clearly another item of descriptive markup, this element serves
mainly to delimit signatures from the rest of the message, and
has no clear presentational function. Some readers may decide
to suppress display of signatures by default, and display them
only upon request, for which the typing inherent in calling
something a "signature" will be of great utility.
4.15 Comment
The |comment| formatting command is used to suppress
presentation of the content, and, as specified, also suppressed
recognition of markup in this content. That is, further
formatting commands will not be recognized, with the exception
of nested |comment|s. The general rule that formatting commands
must be balanced is set aside. This means that one formatting
command works radically different from the other commands, and
this may be confusing to users, as well as increasing the
complexity of richtext interpreters.
DRAFT version 4. 1992-02-13
Richtext and SGML 11
Semantics of Richtext Formatting Commands
4.16 No-op
The purpose of this formatting command is abstract. It serves
mainly to delimit a block of text without changing any
presentation parameters. All unknown formatting commands (i.e.,
extensions) will map to the |no-op| command.
4.17 lt
The sole purpose of |lt| is to provide a mechanism to allow
literal "<" to occur in the text. |lt| produces a less-than
sign in the current typeface. |lt| is an _empty element_ in
SGML parlance.
4.18 nl
Like |lt|, the purpose of |nl| is to insert a fixed string in
the data stream. The |nl| formatting command represents a local
new-line, i.e. a return to the first column (as modified by the
|indent| command), and starting a new output presentation line,
or, as it is commonly called, a "hard" line-break, the "soft"
line-break being implied by reaching the right margin upon
displaying the content of elements. The |nl| is an _empty
element_ in SGML parlance.
4.19 np
This formatting command is like |nl|, in that it introduces a
"hard" page-break to supplement the "soft" page-breaks which
occur when the presentation page is filled. Presumably, the
semantics of the |np| command (and of "soft" page-breaks) is
that the contents of |footing| will be displayed, a new page
started, and contents of the most recent |heading| will be
displayed upon reaching an element which is not |heading| (to
allow for a |heading| immediately after the |np|), or data. The
|np| formatting command would intuitively be independent from
the elements in which it occurs, and could theoretically occur
anywhere. This may or may not be desirable. |np| is an _empty
element_ in SGML parlance.
Author's address
Erik Naggum
Naggum Software
Boks 1570 Vika
0118 OSLO
NORWAY
DRAFT version 4. 1992-02-13