perl-unicode

Re: CGI and UTF

2002-11-24 13:30:04
1) x.0 release. I haven't seen a x.0 release of _any_ software I was
   willing to put the family jewels on without quite a bit of testing
   first.

So are you conducting testing?

2) The very first machine I installed it on immediately had script
   breakage _specifically_ because the rather broken (IMHO) behavior
   re making the use of either 'use bytes' or 'binmode' mandatory

Could you please specify the circumstances of the breakage further?
What got broken, what had to be changed?

   if you want to get the same filehandle behavior semantics on *nix 
   boxes that Perl (and virtually all other *nix programs) have had 
   historically. I don't relish the prospect of identifying essentially
   every use of 'open' in every program we have ever written just to
   add 'binmode' or 'use bytes' to them to proof them against 5.8.0
   originated dain bramage. When I open a file handle and read a file
   I expect (by default) to get _exactly_ what is in the file. If I
   want Unicode semantics, I'll explicitly specify them myself 
   "thenkyouverramuch".

I'm afraid here you can't both have your cake and eat it, see below.

   Unicode is great - I am a huge believer it - but don't
   go mucking up *nix semantics by making 'text mode' filehandles the 
   default: It _breaks_ things that were running 100% clean under
   warnings and strict for years. I've distrusted the trend in Perl for 
   the last few years to 'magically' try to muck with charset encodings; 
   5.8.0 has specifically realized those fears as quite justified.

I'm sorry but you are not being very helpful at all.  You "distrust"
"magic" but you do not really say what behaviour of Perl 5.8.0 you
find disturbing.

The only obvious 'magic' I can think of is the behaviour where Perl
checks your locale settings, and if they indicate use of UTF-8, Perl
switches the default encoding of the STD* streams, and any further
file opens to UTF-8.  This bit of magic was specificially requested by
Larry Wall, and also by the Linux "Unicodification" project.

Other than that, you *do* need to *explicitly* turn on any encoding
conversions on filehandles.  Perl doesn't "guess" on input, or do
any implicit conversions on output.

The other magic I can think of is that Perl scripts can now be
saved in BOM-marked UTF-16, and Perl knows how to parse them.

The locale-induced UTF-8 magic can lead into situation where you have
to explicitly mark your filehandles "binary" (with binmode, please
don't use bytes), because otherwise any data going out would be
expected to be Unicode, that is, *text*.  If you are pushing out
binary bits and bytes, you should tell Perl about it.   You are
also simultaneously complaining about "wanting to specify things
yourself" and "having to use binmode"?

If you are not affected by the locale UTF-8 magic, all handles are
just like they used to be.  In this case you do have to explicitly
tell that a filehandle is Unicode, just like you say you wanted.

Back to the 'UNIX' way of I/O: I'm sorry but I think the UNIX way and
the Unicode can't transparently cohabit.  I'm very much a UNIX geek
and systems programmer, and I like the simple symmetrical world of
UNIX I/O, but I cannot see how the byte streams of UNIX and the
multiple variable and fixed length encodings of Unicode can work
simultaneously without some sort of explicit switching.

-- 
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this 
special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>