While it took a while to start happening, I see that occasionally other
reg exps (such as the one on line 23) complain:
Malformed UTF-8 character (unexpected end of string) at
look_for_probable_nprs.pl line 121, <> line 1021152.
(Line 121 of script:
$w =~ s/^\x{e0}/\x{c0}/;
)
On Sunday, Oct 26, 2003, at 01:12 Europe/Rome, Marco Baroni wrote:
Dear all,
I am new to (explicit) unicode handling, and right now I am facing
this problem.
I have some data (lots of data) that in theory should be in ascii
(with entity references in place of non-ascii characters). I have no
easy way to get to know exactly how these data were generated.
When running the following script:
***********************************************************************
*****
#!/usr/bin/perl -w
binmode(STDOUT,":utf8");
while (<>) {
# ...
$line = $_;
# ...
# line 22 follows:
$line =~ s/([,;:!\?\)\]])/ $1/g;
$line =~ s/(\.?<\/)/ $1/g;
#...
}
# ...
***********************************************************************
*****
I get thousands of warnings like these (they only pertain to line 22,
not line 23 or other lines):
Malformed UTF-8 character (unexpected continuation byte 0xb0, with no
preceding start byte) in substitution (s///) at
look_for_probable_nprs.pl line 22, <> line 86851.
Malformed UTF-8 character (unexpected continuation byte 0xb0, with no
preceding start byte) in substitution iterator at
look_for_probable_nprs.pl line 22, <> line 339770.
Malformed UTF-8 character (unexpected non-continuation byte 0x6f,
immediately after start byte 0xe7) in substitution iterator at
look_for_probable_nprs.pl line 22, <> line 754455.
I looked at a few of the corresponding lines, and they all have some
character that is beyond the ASCII range, and that was not converted
into an entity reference (for example, a c with cedilla, and the > like).
The same script does not complain about the unicode characters that I
insert (using the \x{HEX} notation).
Commenting out the
binmode(STDOUT,":utf8");
line does not change anything (I get the same type of warning).
I run perl v5.8.0 on Mac OS X 10.2.6.
My questions are:
1) What is going on? Is there some documentation I can read that would
make me understand and perhaps fix the problem? (A google search did
not return anything that seemed particularly illuminating.)
2) The worse that can happen is that the incriminated characters will
not be handled properly by the reg exp on line 22 (which should be
ignoring them anyway), or is this a cue that worse things are going > on?
Thanks a lot!
Regards,
Marco
---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni
---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni