Multilingual richtext

The richtext/simpletext discussion seems to have died off in recent weeks.
I'm preparing a description of richtext semantics, which will be ready soon
for posting to the list, I hope.  If you are interested in participating in
the semantics effort, please let me know.

One of the richtext issues that is very close to my heart is support for
multilingual rich mail.  This document is the latest iteration towards this
end, and is much cleaner than my earlier kludges involving '<'.

Cheers,

Rhys.
----------------------------------------------------------------------------
                Multilingual support in MIME richtext

           Last update: 9th March, 1993 by Rhys Weatherley

Disclaimer:

I'm "neutral" on the character set debate.  I can see the merits in all sides
that have been presented.  This document is not meant to advocate any
particular side, but to try to produce something that solves the technical
problems of multilingual richtext, and provide some support for language
tagging.  It is based on ISO-10646 and ISO-639 because they are standardised
(warts and all), and because I don't want to have to define some new character
set or sets solely for richtext use.

The particular technical issues this document attempts to address are as
follows:

        1. It should be possible for '<' to be interpreted as the start of a
           richtext command without ambiguity.  This is important for backwards
           compatibility.
        2. It should be possible to insert language information if the sender
           so desires, but it shouldn't be required.
        3. ISO-10646 encodings should handle UCS-2 with a minimal number
           of bits and be extensible to UCS-4 in a trivial fashion.

I've previously prototyped support for ISO-2022-JP in richtext, but that
prototype goes against issue 1 above unless steps are taken to handle '<'
specially, so I've left it out of this draft.  It and similar character
sets could be reintroduced at a later date once the technical problems have
been solved.

Many of the richtext commands stated below have both "long" and "short"
forms.  I'm as yet undecided whether to allow both forms or just one form.

Indicating ISO-8859-X codepoint encodings:

<iso-8859-x>

        Indicates that the block consists of codepoints from the ISO-8859-X
        character set, for a suitable value of "X".  Richtext commands may
        appear anywhere within the block.  Transport encodings such as
        quoted-printable and base64 may be necessary.  The ISO-8859-X commands
        are provided for backwards compatibility with the first version of
        richtext in RFC-1341.  ISO-10646 commands are preferred for the future.

Indicating ISO-10646 codepoint encodings:

<iso-10646-fss-utf>
<ucs-fss>

        Indicates that the File System Safe version of UTF is used to
        encode ISO-10646 codepoints in the block.  Richtext commands may
        appear anywhere within the block except in the middle of a single
        FSS-UTF character's encoding.  Transport encodings such as
        quoted-printable and base64 may be necessary.

<iso-10646-hex>
<ucs-hex>

        Indicates that the block consists of hex representations of
        ISO-10646 codepoints.  Any white space in the block is treated
        as codepoint separators, and richtext commands may appear anywhere
        within the block.  This is intended for short sequences of ISO-10646
        codepoints, especially in 7-bit environments where FSS-UTF cannot
        be transported without applying quoted-printable or base64 encodings.
        Between 1 and 8 hex digits may be used to indicate a codepoint, with
        the most-significant hex digit first.  Sending systems should use
        upper case for hex digits 'A' through 'F', and receiving systems
        should recognise either case.  Any characters other than hex digits,
        white space or richtext commands should be ignored.  For example:

                <iso-10646-hex>F6 23F 20 12067A2D</iso-10646-hex>

Further "<iso-10646-xxx>" commands may be defined for other encodings.

Indicating language information:

<iso-639-xx>
<lang-xx>

        Indicates a block in the language identified by the 2-letter ISO-639
        code "xx".  For example:

        <iso-10646-hex>aaaa<iso-639-ja>bbbb</iso-639-ja>cccc</iso-10646-hex>

        In this example, "aaaa" and "cccc" are rendered in whatever default
        ISO-10646 conventions the user desires, while "bbbb" should be
        rendered with Japanese-oriented ISO-10646 conventions.  i.e. unified
        Han characters are rendered using Japanese glyph conventions.  If a
        suitable alternative font is not available, no change occurs and
        it is up to the human reader to make the distinction.

        Use of <iso-639-xx> commands is not required, but it is recommended
        that richtext body parts be enclosed in appropriate <iso-639-xx>
        commands if unified characters are used.

        Multiple languages may be in effect at the same time.  For example,
        "aaaa<iso-639-ru>bbbb<iso-639-ja>cccc</iso-639-ja></iso-639-ru>".
        Here, "cccc" uses both Russian and Japanese glyph conventions on
        their respective ISO-10646 codepoint ranges.

Miscellaneous:

The Content-Type "charset" parameter is not required for multilingual richtext.
All text sequences that are intended to be something other than US-ASCII
must be enclosed within appropriate <iso-10646-xxx> and <iso-639-xx> commands.
However, the "charset" parameter can serve a useful purpose to indicate those
character sets that are in use in the body of the message.  A "languages"
parameter may also be useful as informative data to indicate the languages
specified by <iso-639-xx> commands in the body.  This needs discussion.

Observations:

Because multi-octet FSS-UTF encodings do not intersect with US-ASCII, a
case could be made that the default character encoding for richtext is always
FSS-UTF, thus making ISO-10646 the "native" character set of richtext with
<iso-10646-hex> available to provide 7-bit representations of short sequences
in otherwise mostly US-ASCII body parts.  This needs discussion.

The minimal richtext implementation in RFC-1341 requires only a very small
change to process ISO-8859-X and ISO-10646 encodings in a way which eliminates
problems arising from characters with the high bit set.  In particular, any
character with the high bit set is ignored.  In practice, something smarter
than this will be required, but it certainly keeps the minimal implementation
fairly simple.
----------------------------------------------------------------------------