Dear all,
I am new to (explicit) unicode handling, and right now I am facing this
problem.
I have some data (lots of data) that in theory should be in ascii (with
entity references in place of non-ascii characters). I have no easy way
to get to know exactly how these data were generated.
When running the following script:
************************************************************************
****
#!/usr/bin/perl -w
binmode(STDOUT,":utf8");
while (<>) {
# ...
$line = $_;
# ...
# line 22 follows:
$line =~ s/([,;:!\?\)\]])/ $1/g;
$line =~ s/(\.?<\/)/ $1/g;
#...
}
# ...
************************************************************************
****
I get thousands of warnings like these (they only pertain to line 22,
not line 23 or other lines):
Malformed UTF-8 character (unexpected continuation byte 0xb0, with no
preceding start byte) in substitution (s///) at
look_for_probable_nprs.pl line 22, <> line 86851.
Malformed UTF-8 character (unexpected continuation byte 0xb0, with no
preceding start byte) in substitution iterator at
look_for_probable_nprs.pl line 22, <> line 339770.
Malformed UTF-8 character (unexpected non-continuation byte 0x6f,
immediately after start byte 0xe7) in substitution iterator at
look_for_probable_nprs.pl line 22, <> line 754455.
I looked at a few of the corresponding lines, and they all have some
character that is beyond the ASCII range, and that was not converted
into an entity reference (for example, a c with cedilla, and the like).
The same script does not complain about the unicode characters that I
insert (using the \x{HEX} notation).
Commenting out the
binmode(STDOUT,":utf8");
line does not change anything (I get the same type of warning).
I run perl v5.8.0 on Mac OS X 10.2.6.
My questions are:
1) What is going on? Is there some documentation I can read that would
make me understand and perhaps fix the problem? (A google search did
not return anything that seemed particularly illuminating.)
2) The worse that can happen is that the incriminated characters will
not be handled properly by the reg exp on line 22 (which should be
ignoring them anyway), or is this a cue that worse things are going on?
Thanks a lot!
Regards,
Marco
---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni