Re: Reading/writing non-Unicode files with perl5.8?

Hi,

Some months ago Jarkko did post a message to this group that proved quite
usefull for me. see the message below.
I am no expert on this but i figure you can set the binmode of the files you
open to binary. And maybe use
  if ($] > 5.007) {
    require Encode;
    Encode::_utf8_off($s);
  }
on strings that aren't utf.

I hope this is usefull to you,

greetings Merijn (see below for message)

subject Re: 2 Suprises w/5.8.0
On Thu, 1 Aug 2002 06:33:07 +0300, Jarkko Hietaniemi 
<jhi(_at_)iki(_dot_)fi>

said:

  > Pre-5.8 way of Unicode (or, even worse, pre-5.6 way of Unicode) simply
  > is not compatible, and trying to bridge the gap is probably worse than
  > its worth.

I agree with Jarkko if you write new code. But for old code the answer
must be different.

I guess, Daniel has code that works under pre-5.8 and he now wants to
have it run under 5.8 without breaking the compatibility to previous
perl. The reason is easy to understand: you cannot port code to 5.8
with one strike, it can take several months until you have found all
spots in your code that need some change. So he needs to keep
compatibility with older perl until he can switch to 5.8 safely.

I've ported the code of PAUSE to 5.8.0 within a few hours, but just
yesterday I discovered a missing encode_utf8(). Took me many hours to
find it. I was glad that I could run the whole PAUSE under 5.6.1.

Daniel, if this is the background of your request, I'd say:

- keep using Unicode::String
- keep using the utf8 pragma if 5.6.1 needed it
- don't throw away old code until you feel really safe
- enclose all changes you try out for 5.8.0 into

    if ($[ > 5.007){
      # code that isn't understood by 5.6.1
    }

- don't hesitate to ask for practical advice on this list.

These are typical changes that you might need:

A filehandle that should read or write UTF-8:

  if ($] > 5.007) {
    binmode $fh, ":utf8";
  }

A scalar that is going to be passed to some extension, be it
Compress::Zlib, Apache::Request or any extension that has no mention
of Unicode in the manpage:

  if ($] > 5.007) {
    require Encode;
    $self->{CONTENT} = Encode::encode_utf8($self->{CONTENT}); # make octets
  }

A scalar we got back from an extension of which we believe it comes
back as UTF-8:

  if ($] > 5.007) {
    require Encode;
    $val = Encode::decode_utf8($val);
  }

Same thing, if you are really sure, it is UTF-8:

  if ($] > 5.007) {
    require Encode;
    Encode::_utf8_on($s);
  }

A wrapper-function for fetchrow_array and fetchrow_hashref when the
database contains only UTF-8:

  sub fetchrow {
    my($self,$sth,$what) = @_; # $what is one of fetchrow_{array,hashref}
    if ($] < 5.007) {
      return $sth->$what;
    } else {
      require Encode;
      if (wantarray) {
        my @arr = $sth->$what;
        for (@arr) {
          defined && /[^\000-\177]/ && Encode::_utf8_on($_);
        }
        return @arr;
      } else {
        my $ret = $sth->$what;
        if (ref $ret) {
          for my $k (keys %$ret) {
            defined && /[^\000-\177]/ && Encode::_utf8_on($_) for
$ret->{$k};
          }
          return $ret;
        } else {
          defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
          return $ret;
        }
      }
    }
  }


If you have large scalars that you know can only contain ASCII and
might be marked as UTF-8:

  utf8::downgrade($sort) if $] > 5.007;

That's all I needed. You are not alone:-)


--
andreas

----- Original Message -----
From: "Deneb Meketa" <dmeketa(_at_)macromedia(_dot_)com>
To: <perl-unicode(_at_)perl(_dot_)org>
Sent: Tuesday, January 14, 2003 12:35 AM
Subject: Reading/writing non-Unicode files with perl5.8?

I'm a longtime 5.005/5.6.1 user.  I recently upgraded my
Linux system to RH8.0 and got perl5.8 in the bargain.  I
have many perl scripts that read or write non-Unicode files,
mostly ANSI files.  Many of those scripts have broken,
seemingly because of Unicode-forcing behavior in perl5.8.

(It is possible that some other part of my system upgrade is
responsible, like maybe my shell; if anyone knows of some
kind of system-wide Unicode infestation that could be the
cause of these problems, please let me know!)


WRITING:
perl -e 'print pack("H6", "31a931")' > foo

This produces a file with four bytes: 31, c2, a9, 31,
whereas 5.6 would just write exactly the three bytes I
specified.  I have tried all manner of tricks but I just
cannot seem to write a file from perl containing just those
three bytes.  I understand the Unicode translation that is
happening here, I just don't want it!


READING:
perl -e '$c = <STDIN>; while ($c =~ m/./g) {print pos($c), "\n"}' < foo

(This requires a file 'foo' with exactly the three bytes I
listed above: 31, a9, 31)

Output:
1
Malformed UTF-8 character (unexpected continuation byte 0xa9,
with no preceding start byte) in match position at -e line 1,
<STDIN> line 1.
2
Malformed UTF-8 character (unexpected continuation byte 0xa9,
with no preceding start byte) in match position at -e line 1,
<STDIN> line 1.
3

In this case the "malformed UTF-8 character" messages don't
seem to be causing any harm, but they're certainly annoying,
and I have seen other cases (can provide if necessary) where
the script in fact behaves differently.

What I'm reading is not a UTF-8 file - it's an ANSI file!
Is there some way to tell perl to just read the bytes without
translation?


Many thanks in advance.
d.