simpletext: an alternative to Richtext

I've been pondering this richtext (*richmail*) problem for a while
now.  I am not satisfied that it is a reasonable sub-type of `text'
(as opposed to `application'), in the sense that the raw form is not
terribly useful when presented to the user.  For example, look at this
richtext fragment:

, <bold><excerpt>Excerpts from internet.ietf-822: 15-Jul-92 MIME
Clarifications =
, wanted Dana S Emery(_at_)umail(_dot_)umd(_dot_)e (1218*)</excerpt></bold><nl>
, <nl>
, <excerpt>The following is properly nested, but is it proper useage? if
not, wh=
, y?<nl>
, </excerpt><nl>
, <excerpt>       
<lt>underline>__text1__<lt>bold>__Text2__<lt>/bold>__Text3__<=
, lt>/underline><nl>
, </excerpt><nl>
, I don't see why this wouldn't be proper usage -- what  in particular
concerns =
, you?<excerpt><nl>
, </excerpt><nl>
, <excerpt>Also, to a Mac or Next (et al) implementation the following
might mak=
, e sense:<nl>
, </excerpt><nl>
, <excerpt>       
___implicit12ptText___<lt>smaller><lt>smaller><lt>smaller>__9=
, ptText__<nl>
, </excerpt><nl>

I've come up with an alternative suggestion.  I believe it deals with
the goals of richtext at an appropriate level, while not attempting to
usurp functionality better reserved to a full-function markup
language.  Thanks to Erik van der Poel, I've even written it down.
This note is an exposition on the design.

Let me first distinguish between three types of markup [ see Coombs,
Renear, & DeRose, "Markup Systems and the Future of Scholarly Text
Processing", CACM, Nov. 1987 ]: presentational, procedural, and
descriptive.  The first, presentational markup, is in a sense not
`markup' at all; it is the appearance of something that supposed to be
presented directly to a person.  An image or a sheet of paper with
text on it or a book is formatted in `presentational' markup.  This is
the kind of markup that has been used for text in E-mail for years
now: the reader sees the raw form of the message contents.  This is
the type of text markup that almost all mail UA's support.  The second
type of markup, procedural, is more of what richtext proposes.  In
this scheme, the text+markup is intended to be read by some processor
which produces something for the user to see, and the markup consists
of directives to the processor to change into and out of fonts, or to
indent, or whatever.  The third form of markup, descriptive, is what
most adherents of SGML believe in (and what most right-thinking people
everywhere believe in :-).  The markup in this scheme consists of
labels or tags that describe the meaning of various sections of the
text.  Again, this type of markup is often processed by some program,
but in this case the responses of the program to the markup are part
of the program, not part of the document, as they are with procedural
markup.  Descriptive markup can be further divided into generic markup,
where the set of tags is well-specified in advance, and generalized
markup, where the set of tags is extensible.

For example, here is a sentence fragment with each of the three kinds
of markup:

  Presentational:   ...sinking of the _Lusitania_, hostilies...

  Procedural:   ...sinking of the <italic>Lusitania</italic>,
 hostilities...

  Descriptive:   ...sinking of the <shipname>Lusitania</shipname>,
 hostilies...

Clearly, an advantage of the procedural markup scheme is that a small
set of markup capabilities can be defined, limited to some common set
of typesetting commands.  Descriptive markup, though preferable
because content-preserving, is difficult to obtain from most editors,
and suffers from one of the big drawbacks of SGML, the explosion of
possible markup tags (though of course with SGML, agreement on a
standard DTD can remove this drawback).

Well, with this in mind, let's consider markup for text in E-mail.  I
am only considering markup for simple text messages, the kind of thing
that one might dash off quickly, the kinds of uses foreseen for
richtext.

What is the real purpose of simple text markup in mail messages?  I
will argue that it is to allow people to exchange messages that are
literate; that is, in which they can write what they'd like to say.  A
guide-line for this kind of markup exists, in written material.  For
hundreds of years, textual writing has employed a number of simple
devices in which all major idea of western civilization have been
expressed with several simple devices.  Let us consider a literary
genre of a similar length to a mail note, the essay.  Typically, the
markup devices in an essay consist of sentences organized into
paragraphs.  In the sentence, punctuation such as quotation marks
suffices for most markup.  Occasionally a word or phrase is
*emphasized* by use of italics (or, in rare cases, underlining or bold
face).  Paragraphs are typically indented on their first line, and
sometimes separated by a blank line.  Sometimes extended quotations
are inserted, typically both left and right indented.  All text is
typically right and left justified.  Almost always, normal and italic
faces of a single font are used -- the typical exception occurs when
an example of some sort is presented as an quotation, or in technical
works on computing when a `machine-related' term is used [ other
disciplines seem to use italics in the cases where computer writers
shift briefly to a typewriter type face ].  Footnotes are often used.

Suppose we were going to design a simple markup scheme for E-mail, to
allow it the kind of power that printed essays have.  What are the
design constraints we should consider?  First, we want to provide, at
least, all the features enumerated above typically present in an
essay.  Second, we are designing this as a subtype of MIME `text',
which has the peculiar property that it might often be presented
directly to users in its raw form [ otherwise it would be of type
`application', I suppose, in the way that PostScript is ], so we want
our markup to be, as much as possible, presentational.  Thirdly, we
recognize the power of tagging information semantically, so we want
our markup to be, as much as possible, descriptive.  Fourthly, we are
designing for an Internet community, so we want our markup to be, as
much as possible, in accord with current Internet practice.  Fifthly,
our markup is intended for use with simple mail systems, so we want it
to be possible to easily add our markup to a message in a simple text
editor such as vi or GNU Emacs (hmm, simple?).  Sixthly, we want our
markup to satisfy the vast pent-up need for nice-looking messages
surmised by richtext advocates.

So, let's begin a construction.  Suppose we were to send raw text, in
way that pre-Content-Type mailers did.  This would be in accord with
constraints 2, 4, and 5.  What features of constraint 1 are we
missing?  First of all, left and right justification -- so we'll
postulate that text is presumed to be justified.  Line breaks are all
soft, except after a blank line (blank lines should be preserved), or
at the beginning of a paragraph.  Hmm, paragraphs.  Drawing on current
Internet mail practice, we'll postulate that paragraphs are signalled
either by a blank line, or by an indentation of two spaces or a tab at
the beginning of a line.  Next, emphasis.  We'll postulate that
surrounding a word or phrase with asterisks means that that word or
phrase is emphasized (again, choice dictated by constraints 2, 4 and
5).  [ We might strengthen that rule a bit to say that only an
asterisk preceded by a non-alphanumeric character, and followed by an
alphanumeric character, signals the beginning of emphasis. ]  There
seems to be a different non-emphatic form of italics often used in
printed text, which signals a shift from the normal language or
vocabulary into a different language or vocabulary, used for things
like ship names or Latin words.  We'll draw on current Internet usage
again and indicate this _alternate vocabulary use_ by surrounding a
word or phrase with underscores [ a processor may render both of these
in the same type face, or might decide to use, for example, small
block caps for emphasis, and italic face of the regular font for
alternate vocabulary use ].  Extended quotations and excerpts from
various sources are often seen in mail or news articles with each line
preceded by some character such as `>', so we'll adopt this usage as
well and say that each line of a quotation or excerpt will begin with
the two characters _greaterthan_ and _space_.  It is also useful in
computing e-mail (close to necessary) to have a form of quotation
which is literally formatted, that is, in which asterisks surrounding
a word or phrase have no markup meaning; this is commonly used for
sending things like code or transcripts of a terminal session.  It
might reasonably be argued that this should be a separate content-type
subtype of text, but I think it fits nicely into the model we are
developing, so we'll postulate that lines beginning with the
characters _comma_ _space_ are a _literal quotation_, and should be
formatted with a typewriter face when presented, but that other
formatting codes should not be interpreted inside the line.  Line
breaks are treated as hard on lines marked as literal quotations.
Finally, footnotes.  I am not aware of any current internet standard
practice for this, but a reasonable practice might be to say that the
characters _leftsquarebracket_ _space_ begin a footnote, and in a
footnote, the characters _space_ _rightsquarebracket_ terminate it,
the scheme that I've used throughout this posting.  And we'll add
another feature to enhance its value as presentational markup: Each
non-literal-quotation line should be less than 78 characters in
length.

This design, which Erik suggests we call `simpletext', seems to
satisfy all of our six design constraints.  Because of its
presentational nature, it can be presented by unmodified non-MIME UA's
without surprising or offending anyone.  Because of its descriptive
nature, it can be processed accurately by a UA into a cute
richtext-like form with different type faces and justified text [ I've
built a processor which turns simpletext into Andrew text, for example
].  It contains all the textual features used by the best minds in the
history of western civilisation to present their ideas.  It can be
composed by simple text editors, with some small attention to detail.
It draws on current Internet practice for its markup ideas.  Because
of its presentational nature, it can be presented by unmodified
non-MIME UA's with no problems for the user.

This idea isn't new, of course.  I've been told of a system at
Dartmouth in the early 70's, which performed as a heuristic text
formatter, running on a GE timeshare system, which formatted a text
document written on a tty for output on a daisy-wheel printer.  Other
mail systems have been built which incorporated this heuristic
formatting system for mail messages.  What is different is that a
simpletext formatter need not be heuristic, as the rules are
well-specified.

Note that this system does *not* address the problems of multiple
character set usage.  We assume that the character set specified in
the char-set attribute of the text content-type header contains all
the characters necessary for composition of the message.  If it
doesn't, alternate text subtypes are available.

Comments?

Bill