perl-unicode

Re: CGI and UTF

2002-12-26 09:30:05
On Sun, 24 Nov 2002, Jarkko Hietaniemi wrote:

1) x.0 release. I haven't seen a x.0 release of _any_ software I was
   willing to put the family jewels on without quite a bit of testing
   first.

So are you conducting testing?

Slowly, informally. Work schedules leave little time to explore 5.8, 
except like now when I am actually on 'vacation' (which is why it is a 
month since the original message).

2) The very first machine I installed it on immediately had script
   breakage _specifically_ because the rather broken (IMHO) behavior
   re making the use of either 'use bytes' or 'binmode' mandatory

Could you please specify the circumstances of the breakage further?
What got broken, what had to be changed?

Stripped of most irrelevant code and cleaned up slightly, this is
essentially what happened (with the necessary 'binmode' commented out just
to point to the change). Yes - I know about (and frequently use)
Image::Size, et al. This is a fragment of a script that is distributed
'standalone' and so could not depend on anything not distributed with Perl
5.005 to be present.

#!/usr/bin/perl -w

use strict;

my $file = '/home/snowhare/images/test.jpg';
my ($width,$height) = jpegsize($file);

print "width = $width, height = $height\n";
exit 0;

sub readfile {
    my ($filename)=(_at_)_;
    if (! open (NEWFILE,$filename)) {
        print STDERR "$filename could not be opened for reading\n$!";
        return;
    }
#    binmode NEWFILE;
    my ($savedreadstate) = $/;
    undef $/;
    my $data = <NEWFILE>;
    $/ = $savedreadstate;
    close (NEWFILE);

    return ($data);
}

sub jpegsize {
    my ($filename) = @_;

    my $jpeg = readfile($filename);

    my($count) = 2;
    my($length)= length($jpeg);
    my($ch)    = "";

    while (($ch ne "\xda") && ($count<$length)) {
        # Find next marker (jpeg markers begin with 0xFF)
        while (($ch ne "\xff") && ($count < $length)) {
            $ch=substr($jpeg,$count,1); 
            $count++;
        }
        # jpeg markers can be padded with unlimited 0xFF's
        while (($ch eq "\xff") && ($count<$length)) {
            $ch=substr($jpeg,$count,1); 
            $count++;
        }
        # Now, $ch contains the value of the marker.
        if ((ord($ch) >= 0xC0) && (ord($ch) <= 0xC3)) {
            $count          += 3;
            my ($a,$b,$c,$d) = unpack("C"x4,substr($jpeg,$count,4));
            my $width        = $c<<8 | $d;
            my $height       = $a<<8 | $b;
            return($width,$height);
        } else {
            # We **MUST** skip variables, since FF's within variable names are
            # NOT valid jpeg markers
            my ($c1,$c2)= unpack("C"x2,substr($jpeg,$count,2));
            $count += $c1<<8|$c2;
        }
    }   
}

   the last few years to 'magically' try to muck with charset encodings; 
   5.8.0 has specifically realized those fears as quite justified.

I'm sorry but you are not being very helpful at all.  You "distrust"
"magic" but you do not really say what behaviour of Perl 5.8.0 you
find disturbing.

Treating a 'string' as anything but a sequence of 'bytes/octets' _without 
my explicit request or a runtime warning that I haven't specified fh 
semantics_.

The only obvious 'magic' I can think of is the behaviour where Perl
checks your locale settings, and if they indicate use of UTF-8, Perl
switches the default encoding of the STD* streams, and any further
file opens to UTF-8.  This bit of magic was specificially requested by
Larry Wall, and also by the Linux "Unicodification" project.

This is Bad Juju (tm). It _guarantees_ script breakage (potentially
silently!) for Unix people doing _anything_ but ASCII text manipulation.  

If you want to break something as fundamental to *nix boxes as binary mode
filehandles - _at least_ force the script writer acknowledge this _deep_
change to FH semantics. Then they are forced to become aware of the issue
_before_ a script gets its operating assumptions yanked out from under it.

I would lobby for a mandatory runtime warning to be issued on any
filehandle where neither 'binmode FH;' or 'binmode FH, LAYER;' has been
seen before a filehandle is used for the first time with an explanation of
the issue.

The locale-induced UTF-8 magic can lead into situation where you have
to explicitly mark your filehandles "binary" (with binmode, please
don't use bytes), because otherwise any data going out would be
expected to be Unicode, that is, *text*.  If you are pushing out
binary bits and bytes, you should tell Perl about it.   You are
also simultaneously complaining about "wanting to specify things
yourself" and "having to use binmode"?

Yes. Because _needing_ to 'tell Perl' that I am pushing binary rather than
text _is a change_ for *nix platforms. I should have to 'tell Perl' I am
pushing _anything else_ than binary. Or _at a minimum_ a mandatory warning
should be issued that I didn't declare the filehandle's encoding layer and
it is now using encoding 'X' if I haven't explictly indicated that I
*WANT* the system environment changing my filehandle's encodings.

Back to the 'UNIX' way of I/O: I'm sorry but I think the UNIX way and
the Unicode can't transparently cohabit.  I'm very much a UNIX geek
and systems programmer, and I like the simple symmetrical world of
UNIX I/O, but I cannot see how the byte streams of UNIX and the
multiple variable and fixed length encodings of Unicode can work
simultaneously without some sort of explicit switching.

_Explict_ switching is what I am asking for. _Implicit_ switching is what
I am complaining about. If you want to switch based on the system env -
fine: _But at least warn me with a good immediate warnings_ before
changing my fh semantics if I haven't said something like
 
   binmode FH, ':crlf|:raw|:env';

before I go my $data = <FH>;

"Malformed UTF-8 character (unexpected end of string) at
./error-example.pl line 40." isn't useful: It is obscure and is produced
distantly from the actual breakage.

If I hadn't been lurking on the P5P and Perl-Unicode lists for the last
few years, I could have easily been tearing my hair out for hours trying
to a) Figure out what the hell it was talking about and b) Figure out a
work around.

-- 
Benjamin Franz

"If the code and the comments disagree, then both are probably wrong."
                                        -- Norm Schryer, Bell Labs 

<Prev in Thread] Current Thread [Next in Thread>
  • Re: CGI and UTF, Benjamin Franz <=