[Top] [All Lists]

How to handle a lot of character set Content-types

1991-05-02 14:55:15
smart(_at_)mel(_dot_)dit(_dot_)csiro(_dot_)au writes:

There is a lot of talk about using universal character sets. I
think this stems from concern about how to handle the large
number of character sets in use today (from 7-bit Swedish to
Japanesified ISO2022). I wonder whether universal character sets
really solve the problem or just move it to a different level
where it might be harder to control.
the theory is that the sender should be allowed to compose the
message he wants the user to see and let the recipient work out how
to display it

Much as I hate to get into these discussions (mail protocols are a  
bit outside my field).

Unicode, one of the two major competing universal codesets has a few  
characteristics that make it ideal for text interchange between  
systems that aren't guaranteed to speak anything like the same  

1. Unicode includes the entire repertoires of virtually every major  
standard in use now anywhere in the world, including ASCII, 8859  
series, JIS, Chinese & Korean standards, etc., etc.

2. it's a flat 16 bits so it avoids all the weirdo decoding problems  
that other schemes might need.

Here's another method of handling a bunch of codesets simultaneously:

1. map your stuff into Unicode by using a mapping table (it's just  
about guaranteed that any codeset you ever heard of is included in  
2. compress it with some easily available compression protocol
3. uuencode it or whatever you do, with an easily availale protocol
4. ship it however you ship it (7 or 8-bit protocols, whatever)
Then, as Bob says,
5. let the receiver worry about how to map from Unicode to whatever  
the local jargon is.

I find it distressing that there's so much intense discussion of how  
to deal with zillions of these codesets simultaneously using all  
kinds of baroque header information and scheming when a wonderful  
answer is lying around waiting to be picked up.  Just map your codes  
into something, like Unicode, that already includes the repertoire of  
any other standard you'd possibly want to use.  This is one of the  
major wonderful features of Unicode: it includes more existing  
standards and than any the competing universal set.  All you need to  
do is define the wrapper to put around text that's encoded that way.