Re: Use of UTF-8 under Perl and Unix

Markus Kuhn writes:
: As for Perl, my wish list would be:
: 
:   - allow the use of UTF-8 as the only internal string encoding

By default, it'll look like that, but there are some considerations.

First, for efficiency we'll actually still be able to represent ISO-8859-1
in 8-bits internally.  Each string will know whether it's in
an 8-bit encoding or in UTF-8.  Each interface will know whether it's
supposed to be producing UTF-8, so we can convert transparently as
needed in a lazy fashion.  Logically, Perl is processing abstract characters
that can be as large as UTF-8 can represent (and even larger, by your
definition of UTF-8 :-).

Second, we have to have a way of enforcing the old 8-bit semantics in
code that depends on it, so such code will have to be in the scope of
a "use bytes" declaration.

Third, Perl currently allows ISO-8859-1 (or whatever) in string
constants.  There has to be a way to unambiguously tell Perl that the
program is really written in UTF-8.  You argue well for the
unlikelihood of legal UTF-8 sequences in ISO-8859-1, but if they're
gonna occur anywhere, it might be in a Perl script where the characters
in a string might not be German, but rather might just be binary
gobbledygook that someone decided to hardwire into a string constant.
Anyway, we're currently thinking that a "use utf8" declaration will
tell Perl to start expecting UTF-8.  It's also possible we could
automatically switch to UTF-8 processing if we see UTF-8 sequences, but
that's more problematic, and we haven't thought it through yet.

:   - make sure, the regex and string manipulation functions can deal with
:     UTF-8 strings well

That we can already do, though currently only under explicit declaration
that the code in the lexical scope is UTF-8 aware.  We're moving from that
to having to declare that code is explicitly UTF-8 *unaware*.

:   - provide an easy facility to put highly-configurable converters onto
:     every I/O path that Perl supports, including a library of good
:     many-to-one conversion tables

We're still working out what you mean by "easy facility", but yes,
that's important.

:   - If you do some autodetection, make sure that the detection of an encoding
:     and the activation of automatic conversion are two separate issues
:     that are fully under the user's control. For instance, I could imagine
:     a number of library function that
: 
:       - Check for malformed UTF-8 sequences
:       - Check for various types of BOMs
:       - Check for various types of ISO 2022 announcers
:       - Cut good example spots out of a long string of unknown encoding,
:         convert them to UTF-8 under a list of candidate encodings, and
:         present them to the user for manual selection of the most likely
:         encoding.
: 
:     what to do with the results of these library functions should be
:     completely up to the programmer of the application (who can interact with
:     the user and has background knowledge on some channels from the
:     protocol specification).

There could certainly be any number of initial "guessers".  The way I
tend to see it working is that you install some kind of generic guesser
as your line input routine, and as soon as it decides that it knows
what the input really is, it just swaps in a more specific input
routine for efficiency.  This is good not just for detecting
ISO-whatever, but also line ending conventions.  It's also good for
optimizing various input algorithms such as paragraph mode.

Larry