Re: Use of UTF-8 under Perl and Unix

Bram Moolenaar wrote on 1999-11-02 11:53 UTC:

The rule could be:
- ASCII only or 8 bit encoding: no BOM
- non-ASCII characters present: add a BOM


But BOMs get lost far too easily under Unix. How do you want to preserve
them across cut&paste, &c.?

This would work for Vim.  When reading a file, the BOM can be used to set the
'fileencoding' option.  Without the BOM it can use a default encoding, or use
another method to detect the encoding.  When writing a file, Vim can check if
non-ASCII characters are present, and prepend a BOM only when needed.


Then why use the very restricted Unicode BOMs, which can only signal the
various Unicode encodings, but nothing else. ISO 2022 provides ESC
sequences that you can place at the start of a file to signal EVERY
encoding in the ECMA registry. Several hundred different ASCII
extensions have registered ISO 2022 codes to announce them. If you want
to have a stateful encoding with all its uglinees, then better say so by
admitting that what you really want is ISO 2022. ISO 2022 is in no way
worse than BOMs. It has exactly the same problems. A "grep hello *.txt"
still won't work with either BOMs or ISO 2022 unless grep is converted
into a very different piece of software.

Don't misunderstand me: I don't recommend the use of ISO 2022. All I say
is that ISO 2022 is a much better mechanism then BOMs for declaring the
encoding of the following text.

If you receive files via transport mechanisms that have no well-defined
text encoding (e.g., tar, ftp, etc.), then we have to do what we always
have done under these circumstances: By manual intervention, we make
sure that the characters are displayed correctly. So far, this manual
intervention came in the form of selecting the right font for xterm, in
the future it will come in the form of selecting the right conversion
table to Unicode, assuming that more and more applications (xterm, perl,
etc.) will prefer to process text directly in Unicode. To some degree
you can do autodetection, but not to the degree where is makes manual
selection unnecessary. UTF-8 can be failry easily autodetected, by
checking whether no malformed UTF-8 sequences are present. The various
ISO 8859-* sets on the other hand cannot be autodetected without
including something short of a full spell-checker. ISO 8859-* files
would need additional marking much more urgently then UTF-8, therefore I
have real problems seeing any practical advantages of using BOMs (an
ugly hack copied from a CJK standard into Unicode to make a committee
happy that couldn't agree about endian issues), which can only mark
Unicode files.

I am perfectly aware that we will live for some time in a mixed
environment. I have myself been using such a mixed environment for a
long time. Many of my files are ISO 8859-15 encoded, many others are
UTF-8 encoded, some are CP437 encoded, others ISO 8859-7. On my window
manager, I have two options to start up xterm: One is in 8-bit mode with
an ISO 8859-1 locale (that is the xterm that I use to work on 8859-15
files), the other is in -u8 mode with an UTF-8 locale (that is what I
use to work on my UTF-8 files). Note: With "locale", I mean *not* just
POSIX locales (LANG), but also all the other environment variable, shell
aliases, and xrdb settings that affect my applications! I have separated
8859-15 and UTF-8 files in different directories and manually take care
of not mixing them up too much. I am looking forward to the day when
emacs and vim can operate in an UTF-8 mode, because then I will convert
most of my files and will more and more rarely use the ISO 8859-15 xterm
environment.

When I unpack tar files, I assume that they are only in ASCII (which
works 99% of the time). Only when I encounter non-ASCII characters, I
manually start deciding what the character encoding most likely is and
then either switch to the appropriate xterm environment or more likely
call iconv or recode to convert the file to a more convenient encoding.
It doesn't matter which environment I use initially, because as long as
the tar file contained only ASCII, I do not notice any difference
between the two environments, and very few tar files I encounter contain
non-ASCII characters, and if they do, it is usually clear from the
context what the most likely encoding is.

People who think that I live in some UTF-8 only dreamworld severely
misunderstood me, because I practice the pieceful coexistence of ISO
8859-1, ISO 8859-15, ISO 8859-7, CP437/850, and ISO 10646/UTF-8 every
day.

I am not sure what a charset autodetection facility in vim would give
me, because this still doesn't provide autodetection for the many other
tools that I use. Vim is just a tiny part of the entire toolkit. If vim
autodetects UTF-8, this is of no use to me as long as I haven't manually
choosen an appropriate character set for xterm, started xterm in UTF-8
mode, told the applications that read vims output what the encoding is, &c.

Based on my practical experience with switching between different
encodings every day (and supporting fellow international students with a
huge variety of email character requirements (Greek, Thai, Arabic,
Korean, Chinese, French, German, &c.)), I am extremely sceptical about
the usability of adding automatic character-set detection to the Unix
plain text semantics. I like to control file encoding issues manually
until we get to a state where using only UTF-8 becomes feasible.

As for Perl, my wish list would be:

  - allow the use of UTF-8 as the only internal string encoding
  - make sure, the regex and string manipulation functions can deal with
    UTF-8 strings well
  - provide an easy facility to put highly-configurable converters onto
    every I/O path that Perl supports, including a library of good
    many-to-one conversion tables
  - If you do some autodetection, make sure that the detection of an encoding
    and the activation of automatic conversion are two separate issues
    that are fully under the user's control. For instance, I could imagine
    a number of library function that

      - Check for malformed UTF-8 sequences
      - Check for various types of BOMs
      - Check for various types of ISO 2022 announcers
      - Cut good example spots out of a long string of unknown encoding,
        convert them to UTF-8 under a list of candidate encodings, and
        present them to the user for manual selection of the most likely
        encoding.

    what to do with the results of these library functions should be
    completely up to the programmer of the application (who can interact with
    the user and has background knowledge on some channels from the
    protocol specification).

For instance, if I receive a file that might be either ISO 8859-1 or ISO
8859-15, then there should be a Perl function that cuts out a few
example words of this file that contain characters where ISO 8859-1 and
ISO 8859-15 differ, such that the user can decide based on a display of
these example characters in various decodings what the most likely
encoding was. If the file does not contain any of the characters in
which say ISO 8859-1 and ISO 8859-15 differ, then the question whether
the file was encoded in ISO 8859-1 or ISO 8859-15 is obviously
irrelevant for converting it to something else.

I have to do this procedure manually frequently:

  a) make a list of potential candidate encodings

  b) look for a place in the file where the differences between candidate
     encodings do matter

  c) display these places using the various decoding alternatives

  d) select the one that leads to something that looks like correct spelling
     to me

It would be very nice if Perl could provide libraray functions that
assist in the implementation of computer-assisted manual encoding
detection (especially step b)). The world is full of mislabeled MIME
documents and files without or with incorrect character set indication,
and many applications would do well in offering a semiautomatic
character-set guessing facility.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>