Markus Kuhn writes:
Then why use the very restricted Unicode BOMs, which can only signal the
various Unicode encodings, but nothing else. ISO 2022 provides ESC
sequences that you can place at the start of a file to signal EVERY
encoding in the ECMA registry. Several hundred different ASCII
extensions have registered ISO 2022 codes to announce them. If you want
to have a stateful encoding with all its uglinees, then better say so by
admitting that what you really want is ISO 2022. ISO 2022 is in no way
worse than BOMs. It has exactly the same problems. A "grep hello *.txt"
still won't work with either BOMs or ISO 2022 unless grep is converted
into a very different piece of software.
Don't misunderstand me: I don't recommend the use of ISO 2022. All I say
is that ISO 2022 is a much better mechanism then BOMs for declaring the
encoding of the following text.
The basic problem is that we are trying to convey several layers
and types of information in one flat stream of bytes.
Yes, the UNIX/POSIX/C model of a flat stream of bytes is beautiful
and not only that, a *very* useful paradigm (*) indeed that simplifies
many tasks and architectures enormously. For computers and programmers,
Unfortunately the real world is not simple or smooth, it is clumpy,
and the clumps come in different colors and sizes (= different encodings
anddata types, in case I carried my metaphora too far). If we are to
handle clumpiness well, we must accept them all. All legacy data of
the world is not going to be magically converted into Unicode by fiat.
If the character set encoding (for text) or the media type (for other
types of media) were stored in a separate fork/attribute of files,
we would be much closer to a solution. That's a mighty bit "if".
(But of course even that wouldn't not enough: real documents consist
of several, not just concatenated, but hierarchical pieces,
each having their own encodings/types).
(*) That fills my quota of using the p-word for today.
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen