perl-unicode

Re: CGI and UTF

2003-01-04 21:30:04
Treating a 'string' as anything but a sequence of 'bytes/octets' _without 
my explicit request or a runtime warning that I haven't specified fh 
semantics_.

I'm still not quite following what are you being upset about.

(I'm starting to suspect that it must be because I've so completely
bought in to the Unicode model Perl has now and am unable to see what
could be the problem...)  Perl does *not* haphazardly handle a string
as anything else than bytes/octets.  Only if you either

(1) explicitly inject Unicode into by chr(), \x{...}, etc.
(2) either explicitly (binmode) or implicitly (locale) twiddle
    a filehandle so that it converts 

As far as I can understand, you were bitten by the locale.  As I told
you, that is as wanted by Larry, and also by (independently of Perl)
by the Linux Unicode people.

The only obvious 'magic' I can think of is the behaviour where Perl
checks your locale settings, and if they indicate use of UTF-8, Perl
switches the default encoding of the STD* streams, and any further
file opens to UTF-8.  This bit of magic was specificially requested by
Larry Wall, and also by the Linux "Unicodification" project.

This is Bad Juju (tm). It _guarantees_ script breakage (potentially
silently!) for Unix people doing _anything_ but ASCII text manipulation.  

I repeat: I don't think you can do "more than ASCII" by hanging tooth
and nail to the "everything is bytes" credo.

The locale-induced UTF-8 magic can lead into situation where you have
to explicitly mark your filehandles "binary" (with binmode, please
don't use bytes), because otherwise any data going out would be
expected to be Unicode, that is, *text*.  If you are pushing out
binary bits and bytes, you should tell Perl about it.   You are
also simultaneously complaining about "wanting to specify things
yourself" and "having to use binmode"?

Yes. Because _needing_ to 'tell Perl' that I am pushing binary rather than
text _is a change_ for *nix platforms. I should have to 'tell Perl' I am
pushing _anything else_ than binary. Or _at a minimum_ a mandatory warning
should be issued that I didn't declare the filehandle's encoding layer and
it is now using encoding 'X' if I haven't explictly indicated that I
*WANT* the system environment changing my filehandle's encodings.

I repeat: all your filehandles are still 'binary' unless you either
explicitly (binmode) or implicitly (locale) command them not be.
If you try to push Unicode (data marked as UTF-8, such as characters
beyond 255) on such a filehandle, you'll get 'Wide character' warning.
If you do not like the locale implicit switching, reset your locale
to something not /utf-?8/i in it before running the script.

Back to the 'UNIX' way of I/O: I'm sorry but I think the UNIX way and
the Unicode can't transparently cohabit.  I'm very much a UNIX geek
and systems programmer, and I like the simple symmetrical world of
UNIX I/O, but I cannot see how the byte streams of UNIX and the
multiple variable and fixed length encodings of Unicode can work
simultaneously without some sort of explicit switching.

_Explict_ switching is what I am asking for. _Implicit_ switching is what
I am complaining about. If you want to switch based on the system env -
fine: _But at least warn me with a good immediate warnings_ before
changing my fh semantics if I haven't said something like

The assumption is that if you have a locale setup that indicates
UTF-8, Perl is going to assume you knew what you were doing when
you set up the locale.  *All* locale effects are 'implicit'.

   binmode FH, ':crlf|:raw|:env';

before I go my $data = <FH>;

"Malformed UTF-8 character (unexpected end of string) at
./error-example.pl line 40." isn't useful: It is obscure and is produced
distantly from the actual breakage.

perldiag has this:

=item Malformed UTF-8 character (%s)

Perl detected something that didn't comply with UTF-8 encoding rules.

One possible cause is that you read in data that you thought to be in
UTF-8 but it wasn't (it was for example legacy 8-bit data).  Another
possibility is careless use of utf8::upgrade().

-- 
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this 
special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>