How to handle a lot of character set Content-types (long)


There is a lot of talk about using universal character sets. I think
this stems from concern about how to handle the large number of
character sets in use today (from 7-bit Swedish to Japanesified ISO2022).
I wonder whether universal character sets really solve the problem
or just move it to a different level where it might be harder to
control. Anyway the following is a proposal for how to conveniently
handle a large and increasing number of character sets, including
universal ones.

I don't want to pollute this list with too much discussion of how
to write an rfc-xxxx UA. I am sure someone will start a mailing
list for that later on. However I felt I should put in enough to
be convincing. Well to convince me anyway.

Names and the DNS
-----------------

Smart's Law says that whenever you define a flat name space you later
wish you'd made it tree structured.

I will argue that Content-type should be a tree-structured name
embedded in the DNS, and so should a supporting structure of fonts.
This won't actually change anything for things defined centrally
because a Content-type without a "." will have ".content-type.arpa" 
added implicitly. It does mean that user-defined Content-types are
distinguished from standard ones by not ending in ".content-type.arpa"
(explicitly or implicitly) instead of by starting "x-".

The following use of the DNS uses standard RRs in the Hesiod style.
My only reference for how that works is the Ultrix 4.x bind/hesiod
manual. I'm sure the proposal would look neater with new RRs designed
for the job, but that might be hard to arrange.

This document is not meant to suggest that UAs would often talk to
the DNS for this information. Normally it would be cached in local
files designed to be efficiently accessed by the UA. The design of
the system is that existing information will always work: the only
time the UA needs to go back to the DNS is when it meets an unknown
Content-type or an unknown font or an unknown character within a font.
UAs which don't have access to the DNS would have to get their local
tables pre-loaded with everything they think they need -- this is
not a subject for IETF standardization but it doesn't hurt to
remember such people.

On an earlier suggested application of the DNS I was told that we
should remember that the last addition to the DNS, WKS records, was
not a success. This is very different since it uses the DNS to
distribute information from the center. WKS expected the edges to
provide information about themselves.

Transformations should be done in the Recipient's UA
----------------------------------------------------

A particular case we are all concerned about is where the sender's
hardware/software interface is inconsistent with the recipient's
hardware/software so that the recipient can't see the message the way
the sender wants. If the sender knows about the recipient's limitations
then he can perhaps adjust his message (e.g. restrict it to ascii).
But finding out about this is not in general possible (and I'm as
sceptical as anyone about handling that problem with X.500). If the
message is going to be converted in some way to make it displayable
to the recipient then the change should take place in the recipient's
UA. [This argument has nothing to do with the argument for not changing
Transport Encoding on the way -- that argument is altogether more
theoretical and etherial]. In fact it may often be better for the
message to be modified by the recipient's UA under his control than for
the sender to try to hack his message in mangled ascii. Consider:

Suppose the sender wants to send a c-cedilla. Suppose the recipient
can't display a c-cedilla on his terminal (he understands French but is
visiting the US say). We all agree that a nice way to get around this might
be for the thing the recipient sees to be \,c (which I think is the way
you'd write it in TeX).

The recipient will know the behaviour of his own UA. So if the conversion
for c-cedilla to \,c is done in the recipients UA the recipient will see
\,c but in his mind's eye he'll see the c-cedilla -- and mind to mind 
communication is the ultimate objective. He knows that if the sender
had wanted him to actually SEE the 3 characters \,c his UA would have shown
\\,c. [And it is possible that the sender might want him to see \,c if
they were discussing how to write a UA :-)].

If the sender's UA (or MTA or anything else on the way) turns the c-cedilla
into \,c then the recipient might wonder "what am I meant to see here".

So the theory is that the sender should be allowed to compose the message 
he wants the user to see and let the recipient work out how to display
it. He should use his knowledge of the human at the other end to judge
what he wants the message to look like, but not try to guess the
capabilities of the recipient's UA. For the future we will want the
recipient to have a UA that will understand any character-set style 
Content-type that gets thrown at him. This is about how to achieve that.

Character Sets
--------------

In this document ISO2022 will be called a Character set. Perhaps it
is really an encoding allowing the prepresentation of multiple
character sets, but the word encoding is getting overworked. So I
am going to do a Humpty Dumpty [promise to pay it extra].

A font in my terms is altogether simpler than a Character Set. A font
is exactly the same as in TeX: a mapping from 0-255 to a set of
glyphs. Some octets may be undefined in a given font.

Now I am going to assert/guess that all Character sets are very
simple. Each Character set has a small set of states. A message always
starts in state 0 and in a font which is the default font for that
Character set. Each successive character then has a limited effect,
which will vary (perhaps considerably) depending on the state. The
possible effects are:

        1. print from the octet's position of the current font.

        2. change the font.

        3. change the state.

Examples:

i)  Unicode consists of 2-byte characters. To handle this we define
    256 fonts -- unicode-0 to unicode-255. In state 0, an octet X does:
    (1) change state to state 1; (2) change font to unicode-X. In state 1
    each octet does: (1) print; (2) change state to state 0.

ii) Here's all I know about ISO2022

        ESC ( B         introduces ASCII
        ESC ( G         introduces a Swedish set
        ESC $ B         introduces 2-byte Japanese Kanji

    ESC goes to state 1. In state 1 "(" goes to state 2, "$" to state 3.
    In state 2: "B" goes to font ASCII; "G" goes to font Swedish (and
    for both of these we go back to state 0). In state 3 a "B" puts us
    into state 4 and we go into a 2-byte waltz between states 4 and 5
    similar to unicode.

Proposed DNS Structure
----------------------

There are various bits of information we could store about fonts: outlines,
metafont definitions,... The initial proposal is to store various bitmaps.
The assumption is that characters are placed next to each other -- if you
want space between them put it in the bitmap. Here's an "o" in a 5 by 5 
bitmap:

        $origin 5x5.mailascii.font.arpa.
        111             IN        TXT   "-----/-XX--/X--X-/X--X-/-XX--"

"X"s represent 1s, "-"s represent 0s, "/" goes to the next line of the
bitmap. Perhaps "IN" should be "HS"?

Another entry would be "base":

        $origin swedish7bit.font.arpa.
        base            IN      PTR     mailascii.font.arpa.

Says: for undefined characters in swedish7bit look in mailascii. This
is useful when one character set is a slight modification of another.
Too many levels of this could lead to inefficiency.

A Content-type which is a character set has a default font, and a
"program" for each character font combination.

        $origin iso2022.content-type.arpa.
        default-font    IN      PTR     mailascii.font.arpa.
        73.2            IN      TXT     "s=0; f=swedish7bit"
        73.0            IN      TXT     "p"

The 73.2 entry says: octet 73 in state 2 should set state to 0, font to
swedish 7 bit. The 73.0 entry says "in state 0 octet 73 just does a
print". Sometimes we're lucky:

        $origin 1.unicode.content-type.arpa.
        *               IN      TXT     "p; s=0"

In unicode everything in state 1 does a print then set state to 0.
State 0 has to be enumerated exhaustively.

        $origin 0.unicode.content-type.arpa.
        0               IN      TXT     "s=1; f=unicode-0"
        1               IN      TXT     "s=1; f=unicode-1"
        ...

There probably should be alternatives to "p": say "t" for tab behaviour
and "l" for new page bahaviour. We don't have to add ".font.arpa" to
the fonts because that is the default for font names with no ".".

Most Content-types just refer to the appropriate font. They don't
have to say which characters are undefined since that follows from
which positions are undefined in the font.

        $origin mailascii.content-type.arpa.
        default-font    IN      PTR     mailascii.font.arpa.
        $origin 0.mailascii.content-type.arpa.
        9               IN      TXT     "t"
        12              IN      TXT     "l"
        *               IN      TXT     "p"

Programming the General UA
--------------------------

When processing a new Content-type the UA can simply acquire each
character/state pair as it needs them. Alternatively it can try to
be clever and do zone transfers. Either way it should save positive
information it finds in a local database optimized for the UA. It
shouldn't remember negative information (it should be very rare
anyway to get a message with an undefined octet) since new octets
can be added to content-types and fonts, and new fonts can be 
added. Because of this behaviour by UAs, if a glyph or a program
for an octet needs to be changed you have to create a new
Content-type or font.

And here's a recent question

There may also be the complaint that ISO 2022 allows for a non-fixed
number of character sets, since new sets are registered with ISO every
now and then.


No problem.

Flexibility
-----------

Suppose a mayan scholar at the University of Mexico wants to send
some mayan hieroglyphics. No problem:

        Content-type: mayan.content-type.umex.mx
        Content-encoding: base64

He can send his message (or body part within a multipart message)
to someone on the other side of the world who can immediately
read it (depending slightly on connectivity to umex's name server).

Readable 7-bit Encodings
------------------------

It is possible to imagine a reversible encoding which is specific
to a Content-type and driven by DNS entries in the same way. This would
enable every Content-type to have a private readable encoding.

One could also imagine an input program for a Content-type which
allowed the user to pick glyphs from a menu and would then generate
the correct octets for the Content-type. This could also be arranged
to be generated on the fly from DNS information.

Conclusion
----------

There are lots of Character sets now. If we can't get rid of them
we can live with them.

Bob Smart