...has a number of disadvantages:
* It's closely tied-in to how Microsoft Word does things
in handling styles, footnootes, etc...
* It's distinctly not human-parseable.
* Its overly complex for this purpose, and not that easy to parse
without reimplementing the entire Word in your UA. (There is
~280 formatting directives in the specification).
If you want to look att RTF, you should get the spec from Microsoft;
I've typed in the beginning of the one I have below. The complete
document is about 20 pages without much room for explanations and
examples. Its defenitively opaque and prone to misinterpretations. If
someone is willing to type it in in its entirety I will be happy to
fax it to that person...
My comments are enclosed in [] brackets. "Sic" stands for "but it
really says so", for those who don't know...
------------------------------------------------------------------------------
Rich Text Format [1988]
================
The Rich Text Format (RTF) standard is a method of encoding formatted
text and graphics for easy transfer between applications. Currently,
users depend on special translations software to move word processing
documents between different DOS applications and Apple Macintoch
applications.
The RTF standard provides a standard format for text and graphics
interchange that can be used with different output devices, operating
environments, and operating systems. RTF uses the ANSI [sic: Microsoft
throughout uses ANSI, when they mean ISO-8859-1], Macintosh, or IBM PC
character sets to control the representation and formatting of a
document, both on the screen and in print. With the RTF standard,
documents composed under different operating systems and with
different software applications can be transferred between those
operating systems and applications.
RTF Syntax
----------
An RTF file consists of unformatted text, "controll words", "control
symbols", and "groups". A standard RTF file consists of only 7-bit
ASCII characters for ease of transport.
A "control word" is a specially formatted command that RTF uses to
mark printer control codes and information that applications use to
manage documents. A control word consists of a backslash followed by
an alphabetic string and a delimiter, as shown in the following
example:
\rtf1...
A B C
A Backslash begins each control word
B Alphabetic string
C Numeric delimiter
The delimiter can be space or one or more nonalphabetic characters. If
a numeric parameter immediately follows the control word, this
parameter is the delimiter, and is itself followed by a delimiter,
also consisting of a space or one or more alphabetic characters.
[sic!!!]
A "control symbol" consists of a backslash followed by a single,
nonalphabetic character. For example, \~ represents a nonbreaking
space. Control characters take no delimiters.
A "group" consists of text and control words or control symbols
enclosed in braces ({}). Formatting specified within a group affects
only the text within the group. Generally, text within a group
inherits any formatting of the text preceding the group. However,
Microsoft implementations of RTF assume that the footnote,
header/footer, and annotation groups (described later in this
document) do not inherit formatting of the preceding group. Therefore,
to ensure that those groups will always be formatted correctly, you
should set the formatting within these groups with the \sect, \pard,
and \plain control words and then add any desired formatting.
Any other characters in the file are plain text. As mentioned above,
the backslash (\) and braces ({}) have a special meaning in RTF. To
use these characters as text, precede them with a backslash.
Software that takes a formatted file and turns it into an RTF file is
called a "writer". Software that translates an RTF file into a
formatted file is called a "reader". An RTF writer separates the
application's control information from the plain text and writes a new
file containing the plain text containing the plain text and the RTF
groups associated with that text. An RTF reader does the converse of
this procedure.
An entire RTF file is considered a group and must be enclosed in
breces. The control word \rtf n [in the original document there is no
space between rtf and n: rtf is bold and n italic; I have inserted spaces
where necessary to separate components] must follow the
first open brace. The numeric parameter identifies the version of the
RTF standard used. The RTF standard described in this document
corresponds to version 1.
The order of groups within an RTF file is important. Each group
specifies the part of the document affected by the group and the
different attributes of thet text. An RTF file must begin with the
following two control words in the following order:
RTF version (\rtf n)
Character set
The RTF file can also include groups for fonts, styles, screen color,
pictures, footnotes, annotations, headers and footers, summary
information, fields, and bookmarks, as well as document, section,
paragraph, and character formatting properties. If the font, style,
screen color, and summary information groups and document formatting
properties are included, they must precede the first plain text
character in the document. If included, the group for fonts should
precede the group for styles.
The groups are discussed in the following sections. If a group isn't
used, it can be omitted.
Certain groups, referred to as "destinations", mark the beginning of a
collection of related text. An example of this is the \footnote group,
where the footnote text follows the control word. Destinations added
after the RTF specification published in the March 1987 Microsoft
Systems Journal may be proceded by the control symbol \*. This control
symbol identifies destinations whose related text should be ignored if
the RTF reader does not recognize the destination. RTF writers should
follow this convention when adding new control words. Destinations
whose related texts should be inserted into the document even if the
destination is not recognized should not use \*. In this document, all
destinations that use \* will be shown with \* as part of the control
word.
The Character Set
-----------------
After specifying the RTF writer you must declare the character set.
The RTF specification currently support the following character sets:
Control Word Character set
------------------------------
\ansi ANSI (default)
\mac Apple Macintosh
\pc IBM PC
\pca IBM PC page 850, used by IBM Personal System/2
The Font Table
--------------
This group contains descriptions of fonts and begins with the control
word \fonttbl. All fonts available to the RTF writer can be included
in the font table, even if the document doesn't use all the fonts.
A font is defined by its name, a font number, and a font family, as
shown in the following example. Semicolons are used as delimites
between fonts.
{\fonttbl\f0\froman Tms Rmn;}...
A B C D
A Control word
B Font number
C Font family
D Font name
The font numbers represent the full font definitions in the group, and
vary with each document. The font families are listed below:
Control Word Font Family
------------------------------------------
\fnil Unknown or default fonts (default)
\froman Roman, proportionally spaced serif fonts
(TmsRmn, Palatino, etc.)
\fswiss Swiss, proportionally spaced sans serif fonts
(Swiss, etc.)
\fmodern Fixed-pitch serif and sans serif fonts
(Courier, Elite, Pica, etc.)
\fscript Script fonts (Cursive, etc.)
\fdecor Decorative fonts (Old English, Zapf Chancery,
etc.)
\ftech Technical, symbol, and mathematical fonts
(Symbol, etc)
If an RTF file uses a default font, the default font number is
specified with the \deff n control word, which must precede the font
table group. The RTF writer supplies the default font number used in
the creation of the document as the numeric argument. The RTF reader
then translates this number through the font table into the most
similar font available on the reader's system.
The Style Sheet
---------------
The style sheet group begins with the control word \stylesheet. This
group contains definitions and descriptions of the various styles used
in the document. The style sheet is declared only once, in the RTF
file header. All styles in the document's style sheet can be included,
even if not all the styles are used.
In some applications, styles are based on, or are the basis for, other
styles. In these cases, two other control words can be used:
Control Word Meaning
--------------------------------------
\sbasedon n Defines the number of the style on which
current style is based
\snext n Defines next style associated with current
style; if omitted, next style is the current
style.
An example of an RTF style sheet and styles is shown in the following
example. In this example, Postscript is declared but not used. Some of
the control words in this example are discussed in the following
sections.
...
A {\stylesheet{\fs20\sbasedon222\snext0 Normal;}{\s1\qr\fs20
\sbasedon0\snext1 FLUSHRIGHT;}{\s2\fi-729\li720\fs20\ri2880\fs20
\sbasedon0\snext2 IND;}}
...
\widowctr\fntbj\ftnrestart\sect\linex0\endnhere
B \pard\plain\fs20 This is Normal style.
\par\pard\plain \s1
This is right justified. I call this style FLUSHRIGHT.
\par\pard\plain \s2
This is an indented paragraph. I call this styve IND.
It produces a hanging indent.
\par}
...
A Style sheet
B Styles applied to text
------------------------------------------------------------------------------
____________________________________________________________________________
_ :
/ Mats Ohrman : matoh(_at_)sssab(_dot_)se
: {mcvax,munnari,uunet}!sunic!sssab!matoh
Scandinavian System Support AB :
Box 535 _ : Phone: Nat. 013-11 16 60
S-581 06 Linkoping, Sweden : Int. +46 13 11 16 60