Re: ISO 2022 versus UTF-8 autodetection heuristics



       Maybe I am simply confused, but why are we talking about
       lumbering the Perl kernel with ISO 2022? If Markus wants to
       write applications which conform to such standards, in Perl,
       then he can. There is nothing to stop him, and thanks to the
       early design decisions of the Unicode committee he will also
       get round trip compatibility for many of his encodings.

       As far as application specific data is concerned, is it not the
       duty of the application to address the encoding issues rather
       than requiring Perl to do all the work for you?

       There is only one file format that Perl needs to be concerned
       with internally, and that is source code. If we are needing to
       handle source coding in a variety of encodings, then perhaps
       the solution would be to have a short Perl program which
       decides which filter to use on the source files it is asked to
       open. I'm not sure whether this is currently in 5.6. This then
       allows people to write any filter they want for any encoding of
       a source program.

       The problem then becomes what encoding to write the filter
       identifier in. But I would think the default Perl encoding
       (ASCII/UTF8) is sufficient for this. In addition there is the
       problem of getting the filter identifier program to be used
       when the Perl code is being read. If the program could be
       expressed as a module, or via a new command line option (say
       -f), then Perl can be told which program to filter code through
       on start up. In fact, in many cases, where the encoding is
       ASCII conformant (for lower ASCII that is), then the #! can do
       the work for you.

       For all the criticisms made against it, I think an approach
       similar to that used in XML for identifying the encoding to use
       to search for the #! would serve very well.

       I don't know whether this is too high a price to pay, but I
       think it may solve the multiplicity of encodings for code,
       issue.

       Martin Hosken