perl-unicode

Re: CGI and UTF

2003-01-05 11:30:05
On Sun, Jan 05, 2003 at 12:16:38PM -0600, Earl Hood wrote:
This is Bad Juju (tm). It _guarantees_ script breakage (potentially
silently!) for Unix people doing _anything_ but ASCII text manipulation.  

I repeat: I don't think you can do "more than ASCII" by hanging tooth
and nail to the "everything is bytes" credo.

This statement assumes someone is working with characters.  It is
common for many to use regexs and other operators (substr, index,
et. al.) on binary data directly.

True.  I think what I was referring to (somewhere earlier in my
message) is that you won't get Unicode data mixed into your data
unless you ask so, explicitly or implicitly.

I repeat: all your filehandles are still 'binary' unless you either
explicitly (binmode) or implicitly (locale) command them not be.
If you try to push Unicode (data marked as UTF-8, such as characters
beyond 255) on such a filehandle, you'll get 'Wide character' warning.
If you do not like the locale implicit switching, reset your locale
to something not /utf-?8/i in it before running the script.

I think this reasoning is flawed since it assumes the author of
the script has complete control over the environment.  For example,
the script can be used by others in environments the author does not
control.  Therefore, older programs can quietly break, or behave
different.

According the perllocale manpage, locale should have no effect
unless the 'use locale' pragma is specified.  It appears from
Benjamin's script that he is not using the pragma, so even if the
environment has a utf-8 locale, the script should be unaffected.

True, too.  The enabling of UTF-8ness based on locale is an
exception as to how things were done before.  But I'm delegating
responsibility about that decision to Larry Wall :-)
I'm trying to get an opinion about this from him, and I just logged
a problem ticket about this issue. 

--ewh

-- 
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this 
special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>