Re: 16/32-bit charsets and MIME-encoding

Erik van der Poel some time ago defined a Base64 like encoding for
16-bit codes.  Something in that way is what is needed. Though the
encoding should probably have several escape codes to reduce overhead. 
By encoding 6 bits per encoding character we could encode as follows
(x means a base64 encoding 6 bits):
  =xx           (encode 12 bits)
  ?xxx          (encode 18 bits)


I made a similar proposal, actually.  On the Unicode(_at_)Sun(_dot_)COM list I
proposed the following structure:

    Characters                        Codes            No. of bytes

    <e1>                              <e1><e1>         2
    <e2>                              <e2><e2>         2
    <e3>                              <e3><e3>         2
    All other ASCIIs                  <a>              1
    Latin-1 letters                   <e1><b>          2
    All other Unicodes up to 0xfff    <e2><b><b>       3
    All other Unicodes                <e3><b><b><b>    4

Note that there are exactly 64 LETTERS in Latin-1, so these can be
encoded in 6 bits (i.e. one Base64 character).

From a political point of view, it may be better to cover all of
10646, though (not just the 16-bit Unicode part).

For declaring the ISO 10646 in MIME we could use UCS as character code
name.  But also we may have to include "level" as Unicode what to use
what probably will be level 3 in the IS allowing any combination of
combining characters. It would be better if we could restrict us to
only allow combining characters when no code exists for the combined
character (probably level 2 in the IS) as this would simplify UCS <->
local coding.


People probably have differing opinions about this.  Combining
character purists would probably want even the Latin-1 precomposed
characters to be represented as 2 characters each (i.e. base and
combining).

On the other hand, some may want a compact representation for email.


Erik