PS (Malformed UTF-8 character)

While it took a while to start happening, I see that occasionally otherreg exps (such as the one on line 23) complain:

Malformed UTF-8 character (unexpected end of string) atlook_for_probable_nprs.pl line 121, <> line 1021152.


(Line 121 of script:

    $w =~ s/^\x{e0}/\x{c0}/;

)

On Sunday, Oct 26, 2003, at 01:12 Europe/Rome, Marco Baroni wrote:

Dear all,
I am new to (explicit) unicode handling, and right now I am facingthis problem.
I have some data (lots of data) that in theory should be in ascii(with entity references in place of non-ascii characters). I have noeasy way to get to know exactly how these data were generated.
When running the following script:
****************************************************************************
#!/usr/bin/perl -w
binmode(STDOUT,":utf8");

while (<>) {

# ...

    $line = $_;

# ...

# line 22 follows:
    $line =~ s/([,;:!\?\)\]])/ $1/g;
    $line =~ s/(\.?<\/)/ $1/g;

#...

}

# ...
****************************************************************************
I get thousands of warnings like these (they only pertain to line 22,not line 23 or other lines):
Malformed UTF-8 character (unexpected continuation byte 0xb0, with nopreceding start byte) in substitution (s///) atlook_for_probable_nprs.pl line 22, <> line 86851.
Malformed UTF-8 character (unexpected continuation byte 0xb0, with nopreceding start byte) in substitution iterator atlook_for_probable_nprs.pl line 22, <> line 339770.
Malformed UTF-8 character (unexpected non-continuation byte 0x6f,immediately after start byte 0xe7) in substitution iterator atlook_for_probable_nprs.pl line 22, <> line 754455.
I looked at a few of the corresponding lines, and they all have somecharacter that is beyond the ASCII range, and that was not convertedinto an entity reference (for example, a c with cedilla, and the > like).
The same script does not complain about the unicode characters that Iinsert (using the \x{HEX} notation).
Commenting out  the

binmode(STDOUT,":utf8");

line does not change anything (I get the same type of warning).

I run perl v5.8.0 on Mac OS X 10.2.6.

My questions are:
1) What is going on? Is there some documentation I can read that wouldmake me understand and perhaps fix the problem? (A google search didnot return anything that seemed particularly illuminating.)
2) The worse that can happen is that the incriminated characters willnot be handled properly by the reg exp on line 22 (which should beignoring them anyway), or is this a cue that worse things are going > on?
Thanks a lot!

Regards,

Marco


---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni

---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni