perl-unicode

UTF-8 and matching [^\s]

2005-01-26 04:20:59
Hi everyone,

I've run into problems matching the regex [^\s] on RedHat 8/9 and the
version of perl shipped with it (5.8.0).  I've googled around and am
aware that there are some problems with UTF-8 on this platform.

I'm trying to write a script that will work with this version and
earlier versions of Perl (I can't install a new version as I'm sending
out scripts to people who won't want to do this).

The problem:
------------
Given the string: $_ = "%define pfx x";
The regex: m,^%define\s+([^\s]+),;

Does not match on RH8/9 unless you change the LANG environment varible
to a non-UTF-8 entry.

For some reason, the pragma: no utf8; doesn't seem to make any difference.

I can get it to work by changing the pattern to: m,^%define\s+([\S]+),;
but this is not what I want because I have legacy scripts that I can't
easily change. Furthermore, I want to use patterns like: [^\s/] (e.g
more than one negated character type).

I found a work around.  If I change the start-up line to include LANG=C,
it works:

eval 'LANG=C exec perl -w -S $0 ${1+"$@"}'
      if $running_under_some_shell;

I've attached a test script that shows the problem (remove the LANG=C to
make it break).

Question:
---------
Does anyone know a better way of working around this problem? (e.g.
getting 'no utf8;' to work.

TIA, Stuart






eval 'LANG=C exec perl -w -S $0 ${1+"$@"}'
    if $running_under_some_shell;
$running_under_some_shell = 0;

# This doesn't make any difference ?
#no utf8;

# The pattern to match in
$_       = "%define pfx x";

# write the file to a temp file and read back in
$tmpfile = "/tmp/trash991";
open F, ">$tmpfile" or die;
print F $_;
close F;
$/ = undef;
open F, $tmpfile or die;
$_ = <F>;
close F;
unlink $tmpfile;

# bad on Perl 5.8.0 with LANG set to any UTF-8
m,^%define\s+([^\s]+),;

# bad on 5.6
# m,^%define\s+([^\p{IsSpace}]+),;
# bad on 5.6
# m,^%define\s+([\P{IsSpace}]+),;
# okay on all but don't understand the difference
#m,^%define\s+([\S]+),;

print "'$1'\n";



<Prev in Thread] Current Thread [Next in Thread>
  • UTF-8 and matching [^\s], Stuart Hughes <=