perl-unicode

Re: UTF-8 and matching [^\s]

2005-02-02 13:11:45
Stuart Hughes <seh(_at_)zee2(_dot_)com> writes:
Hi everyone,

I've run into problems matching the regex [^\s] on RedHat 8/9 and the
version of perl shipped with it (5.8.0).  

It isn't 5.8.0 is 5.8.0-with-RedHatBugs :-(
To be fair to them it is some development track thing - there was 
an experimental scheme to honour the locale UTF-8-ness.

But experiment failed and no "released" perl ever had the mis-feature.


The problem:
------------
Given the string: $_ = "%define pfx x";
The regex: m,^%define\s+([^\s]+),;

Does not match on RH8/9 unless you change the LANG environment varible
to a non-UTF-8 entry.

For some reason, the pragma: no utf8; doesn't seem to make any difference.

I can get it to work by changing the pattern to: m,^%define\s+([\S]+),;
but this is not what I want because I have legacy scripts that I can't
easily change. Furthermore, I want to use patterns like: [^\s/] (e.g
more than one negated character type).

It breaks Tk's Makefile.PL too.


I found a work around.  If I change the start-up line to include LANG=C,
it works:

eval 'LANG=C exec perl -w -S $0 ${1+"$@"}'
      if $running_under_some_shell;

I've attached a test script that shows the problem (remove the LANG=C to
make it break).

Unless one is in a non-UTF8 locale already when if course it works.


Question:
---------
Does anyone know a better way of working around this problem? (e.g.
getting 'no utf8;' to work.

You can only do that by changing the binary (e.g. to REAL 5.8.0 
or any later 5.8.*) and you said you didn't want to do that.




<Prev in Thread] Current Thread [Next in Thread>
  • Re: UTF-8 and matching [^\s], Nick Ing-Simmons <=