[Top] [All Lists]

Re: How to handle a lot of character set Content-types

1991-05-02 18:20:54
Tim Kehres @ writes:

I have heard that in order to properly implement Unicode, all
implementations need to be able to always recognize and render all
incorporated character sets.

This is Hilarious!  And definitely 100% wrong.  It is so obviously  
tripe cooked up by the competition to make Unicode look unattractive!   
Unicode requires NO SUCH THING!  DON'T YOU BELIEVE IT for a minute!!   
Let's stamp out this horrifying rumor once and for all!

I'm relatively intimate with Unicode and feel reasonably qualified to  
comment on this, so I'll take a crack at explaining...

Formal conformance to Unicode means that when you are supposed to be  
passing things through, you are agree to pass through codes that you  
don't understand without damaging them.  There is a BIG difference  
between agreeing to pass things along uninjured, and claiming to be  
able to actually "Interpret" them in any way.  ("Interpretation" of a  
character means that the system or application understands the  
character well enough to display it, sort it, or otherwise operate  
upon it with the character's intended semantics.) A Unicode  
implementation must be able to pass 16-bits through unharmed.   
Whether or not it can Interpret those codes is a completely separate  
issue.  Wow!  Why would anyone FORCE all implementations to carry  
around all of the baggage for displaying all possible characters?    
Some implementations may be able to interpret only Latin 1, or ASCII,  
or even the single letter "a"!  BUT, if they can pass through 16-bits  
unharmed, and promise never to display GARBAGE when they don't  
understand a character, voila! they're conformant.  If an  
implementation doesn't understand a particular character, as may  
sometimes be the case, it can do any number of things, such as print  
a little box or ring the bell.  It's just prohibited from spitting  
out random garbage AS IF it could interpret the characters.

So let's say I pass you a plaintext Unicode file that contains a  
bunch of Bengali.  You bring it up on your Unicode system, which has  
fonts and facilities for only Latin 1 and Japanese.  What do you see?   
You might see little boxes where ever I have written a Bengali  
character.  The system does not PRETEND that it knows what it's  
doing.  Hence, you DO NOT SEE Japanese or Latin 1, you see your  
system's unmistakable signal that it has encountered character codes  
which it is unable to interpret for you.  Sorry.  You could fix the  
problem maybe by purchasing a Bengali font or something, but hey, if  
someone sends you Bengali once in a blue moon, and you can't read it  
anyway, why should you pay good money for a Bengali font?