Re: ISO 2022 versus UTF-8 autodetection heuristics

martin_hosken(_at_)sil(_dot_)org writes:
:        Maybe I am simply confused, but why are we talking about
:        lumbering the Perl kernel with ISO 2022?

Well, one datapoint is that I haven't been talking about it.  :-)

:        As far as application specific data is concerned, is it not the
:        duty of the application to address the encoding issues rather
:        than requiring Perl to do all the work for you?

Perl is all about making things easy for the programmer.  Which comes
down to the question of whether there is some dwimmish behavior that
can be factored out from every program people will write in some
particular domain.  And the flip side of that, whether factoring out
the dwim will cause other people more pain than it's worth.  I probably
wouldn't hack ISO 2022 into Perl just for the fun of it, but if some
input guesser wanted to use that as one of its criteria for guessing,
that sort of thing might be useful to some people without giving undue
pain to others, especially if they choose a different default guesser
for their input streams.

:        There is only one file format that Perl needs to be concerned
:        with internally, and that is source code. If we are needing to
:        handle source coding in a variety of encodings, then perhaps
:        the solution would be to have a short Perl program which
:        decides which filter to use on the source files it is asked to
:        open. I'm not sure whether this is currently in 5.6. This then
:        allows people to write any filter they want for any encoding of
:        a source program.

There are already ways to do source filters, but they could use to be
generalized to input filters.

:        The problem then becomes what encoding to write the filter
:        identifier in. But I would think the default Perl encoding
:        (ASCII/UTF8) is sufficient for this. In addition there is the
:        problem of getting the filter identifier program to be used
:        when the Perl code is being read. If the program could be
:        expressed as a module, or via a new command line option (say
:        -f), then Perl can be told which program to filter code through
:        on start up. In fact, in many cases, where the encoding is
:        ASCII conformant (for lower ASCII that is), then the #! can do
:        the work for you.

As I say, there's already a mechanism for that, but nobody's written
source filters to translate arbitrary codings to UTF-8 because the UTF-8
stuff is so new.

:        For all the criticisms made against it, I think an approach
:        similar to that used in XML for identifying the encoding to use
:        to search for the #! would serve very well.

That's a no-brainer.  I don't think Perl has to reduce itself to the
ambiguities of arbitrary text files.  It is, after all, a programming
language.

:        I don't know whether this is too high a price to pay, but I
:        think it may solve the multiplicity of encodings for code,
:        issue.

I'm not really terribly worried about it.  As soon as we get arbitrary
translation of input streams, we just treat the script as another
arbitrarily translated input stream, since it already is one.  No biggie.

Larry