[Top] [All Lists]

RE: Charset mandatory in unix/linux

2006-03-27 04:13:57

[John Cowan has writtten this reply to Bruce's mail and 
has given me permission to forward it.  Misha]

These typically (and specifically for tar and zip) do not include
media type information or charset or other type parameter information.
The information in the tar format, for example, carries time stamps,
permissions, and file type (where "type" means plain file vs. directory
vs. device, etc.).

ZIP format, however, has the ability to store extended attributes,
though this is normally only used on VMS and OS/2 systems.  See for the

The holy grail of a single unified character set that will supposedly
solve the problem sounds nice until one looks at the details.

I think this line, and the rant which follows, is a serious exaggeration
of the facts.

"Unicode" is itself a "vast array" (ever-increasing in number) of
character code sets.  Saying "Unicode" doesn't tell me if that's
pre-"Korean mess" (see RFC 2279) "Unicode" or post-"Korean mess"

The pre-Korean-mess versions (1.0 and 1.1) do in fact have their own
charsets: Unicode-1-1 and Unicode-1-1-UTF-8.  However, no one has ever
come forward with actual text encoded in anger that contains pre-mess
Korean syllables, so it's almost entirely academic.

Or whether that's the "Unicode" that has among its design principles
a uniform code width of 16 bits and an encoding strictly of text
(specifically excluding musical notation), or the "Unicode" that has a
much wider code width and includes non-textual cruft such as (yes, you
guessed it) musical notation.  Or whether it's one of the "Unicode"s
that has an attempt at encoding language information (versions 3.1
and 3.2), or one of the "Unicode"s (earlier and later) that do not.
And so on.

All this is about one thing and one thing only:  what characters
your implementation can handle.  If you have an old 8-bit or 16-bit
implementation trying to process text involving new characters, it
won't know what to make of them, but it won't be seriously confused.
New implementations can of course handle old text fine using their more
advanced models, because everything is backward compatible.

As for language tagging in plain text, it was demanded by an IETF WG,
was in effect born deprecated, and is formally deprecated as of 4.0 but
of course remains present and will forever.

as far as I know, it's not even possible to have multiple versions of
Unicode and to transcode between them on the same machine.

It's not possible because (other than the Korean mess) it's not necessary

[S]uppose Jacob has received a text file in Korean and the issue of
labeling the charset and language is solved.  If it is labeled as
"ISO-2022-KR", he can proceed to make sense of the file; conversely
if it is labeled as "utf-7" he cannot because he lacks information
to determine whether the result of the transformation to "Unicode"
should be interpreted as groups of 16 bits or some other code width,
as well as which code points represent various hangul characters.

The former point is not relevant: you can interpret it as 16-bit or
21-bit codes without any difference in effect.  The latter point I have
already addressed.

John Cowan  cowan(_at_)ccil(_dot_)org
The competent programmer is fully aware of the strictly limited size of his own
skull; therefore he approaches the programming task in full humility, and among
other things he avoids clever tricks like the plague.  --Edsger Dijkstra

To find out more about Reuters visit

Any views expressed in this message are those of the individual sender, except 
where the sender specifically states them to be the views of Reuters Ltd.

<Prev in Thread] Current Thread [Next in Thread>