perl-unicode

Re: Keeping byte-wise processing as an option

2004-01-02 23:30:04
On Fri, 02 Jan 2004 18:17:13 -0500, Martin Duerst 
<duerst(_at_)w3(_dot_)org> said:

Jungshik has also reported that
it fails with Perl 5.8.0 with an UTF-8 locale.

Perl 5.8.0 was very broken with UTF-8 locales since it "auto-PERL_UNICODEd".
We saw (keep seeing) a lot of that since RedHat 8 and 9 had the unfortunate
combination of both Perl 5.8.0 _and_ UTF-8 locales (which the users didn't
expect/know about/care about).  Lots of code that expected to produce e.g.
0xff started to produce 0xc3 0xbf.  Bang!
Use rather 5.8.1 or later.

  > If it were just me, that would be easy. But stating on an FAQ
  > page 'use Perl 5.8.1 or later' for something that worked
  > probably even in Perl 4 doesn't look like a good idea.

I seem to remember I heard Matt Sergeant (CC'd; Hi Matt, sorry if I
misremember) say that he has a large codebase that works with perl
5.00503, 5.6.x and 5.8.x. I don't think that the tricks you need to
program around the Unicode cliffs through perl versions are collected
in a document.

I can say for sure that I have managed to have the PAUSE code
(ftp://pause.perl.org/pub/PAUSE/PAUSE-code/) run under both 5.6.1 and
5.8.x.

The typical idiom I used was:

    if ($] > 5.007) {
      require Encode;
      # let Encode do some tweaking
    }

The tricks that I used, have found their way into
perlunicode.pod/"Porting code from perl-5.6.X".

I suppose your one-liner would work with (untested)

       #!/usr/bin/perl -pi~ -0777
       # program to remove a leading UTF-8 BOM from a file
       # works both STDIN -> STDOUT and on the spot (with filename as argument)
       if ($] > 5.007) {
         require Encode;
         Encode::_utf8_off($_);
       }
       s/^\xEF\xBB\xBF//s;


What I'm looking for is a very simple way to write perl programs
that work on byte streams. This should be possible without depending
on versions, working both on very old versions as well as future
versions.

Off-hand I can say that getting both 5.6 and 5.8 work at the same time
may be impossible in spots simply because 5.6 was badly unfinished as
regards to Unicode.  No, it won't get fixed.  Beyond 5.8, I don't.

  > Sorry, I think you missed something in the last sentence. Did you
  > want to say "I don't know?".

Some people may have some tricks they use to get Unicode code working both
in 5.6 and 5.8, but _in_principle_ the bytes pragma should tell Perl in
both 5.6 and 5.8 that "I want bytes, darn it."

  > Yes, that seems to do the job. But is this available in 5.0 or earlier?
  > Or is it possible to write some little code at the start that says
  > something like:

  > if (eval "use bytes;") { use bytes; }

That would be

  use if $] >= 5.006, "bytes";

But you would have to make sure that if.pm is available, no option IMO.

  > (without making the actual invocation restricted to the { ... } ?


-- 
andreas