perl-unicode

Re: My favorite bug to fix for 5.8.0

2002-03-10 00:33:14
On Sun, Mar 10, 2002 at 10:06:14AM +0900, Dan Kogai wrote:
(and the very lack thereof) to blame.  Locale might have worked on roman 
systems but hardly ever so on CJK (yeah, I know some do use locale, such 
as tcsh but I never dared use it or my serial console may be rendered 
useless).

Seems that our mileage do vary. :) Almost all programs in my Chinese
environment on FreeBSD recognize/depend on locale settings to some degree:
tcsh, xcin, mutt, mozilla, openoffice, the melix BBS &c. It's true that
I have to add env LC_ALL=zh_TW.Big5/en_US.ISO_8859-1/zh_CN.EUC in front
of each aliases, though.

   eval {
    binmode IN, ":utf8";
    binmode STDIN, ":utf8";
    binmode STDOUT, ":utf8";
   };

  I am not going to yell at something already fixed but I can't keep 
grumbling this new use of binmode sucks.  Reminds me too much of DOS.

Interesting. I've been putting

    use open ':locale';

In front of some programs now; it works reasonably well.

In the more-difficult-but-oh-so-user-friendly category, it would also
be lovely if someone came up with a dwimmish layer that could recognize
when it isn't getting UTF-8 and attempt autorecognition of other

  I beg not to squeeze locale into Unicode features.

Agree with you here. Telling a short UTF-8 file from any CJK encodings is
bound to be error-prone (many sequences are legal in both maps), and if
we rely on locale information to be the final judgement, things will quickly
gets out of control.

However, if we limit the :guess layer to UTF-8 *and* ISO-8859-1 only, then
I think it's probably okay, as I'd just stay away from using such features.

  We already have utf8 pragma.  If we really need something to make utf8 
stream by default (yet leave other things in 'use bytes;' realm),  why 
don't we just extend it like

use utf8 qw(:filehandle);

Currently, use open ':locale'; pretty much fits my needs (with some small
patch; see attached). If the support of en_US.UTF-8 locale is there in the
OS, why don't we use that information?

/Autrijus/

---

The patch does the following:
- Nix the unneccessary diagnostics line
- Quell -w warnings if the first ENV doesn't exist
- While zh_CN means euc-cn, zh_TW almost invariably mean big5, as euc-tw
  is too baroque and bloated for daily use (and for perl core inclusion).
- "Cannot figure out an encoding to use" when locale is 'C' is rendered
  non-fatal.
- Consequently, the ^OPEN bits is set only when needed.

--- /usr/local/lib/perl5/5.7.3/open.pm  Tue Mar  5 14:00:13 2002
+++ ./open.pm   Sun Mar 10 15:18:54 2002
@@ -18,6 +18,6 @@
        };
-       unless ($@) {
-           print "# locale_encoding = $locale_encoding\n";
-       }
        my $country_language;
+
+       no warnings 'uninitialized';
+
         if (not $locale_encoding && in_locale()) {
@@ -47,4 +47,6 @@
                $locale_encoding = 'euc-kr';
+           } elsif ($country_language =~ /^zh_CN|chin(?:a|ese)?$/i) {
+               $locale_encoding = 'euc-cn';
            } elsif ($country_language =~ /^zh_TW|taiwan(?:ese)?$/i) {
-               $locale_encoding = 'euc-tw';
+               $locale_encoding = 'big5';
            }
@@ -77,3 +79,3 @@
                    unless defined $locale_encoding;
-               croak "Cannot figure out an encoding to use"
+               (carp("Cannot figure out an encoding to use"), last)
                    unless defined $locale_encoding;
@@ -108,3 +110,3 @@
     }
-    ${^OPEN} = join("\0",$in,$out);
+    ${^OPEN} = join("\0",$in,$out) if $in or $out;
 }

Attachment: pgpWe8jVo69Qo.pgp
Description: PGP signature