perl-unicode

PS (Malformed UTF-8 character)

2003-10-25 16:30:05
While it took a while to start happening, I see that occasionally other reg exps (such as the one on line 23) complain:

Malformed UTF-8 character (unexpected end of string) at look_for_probable_nprs.pl line 121, <> line 1021152.

(Line 121 of script:

    $w =~ s/^\x{e0}/\x{c0}/;

)

On Sunday, Oct 26, 2003, at 01:12 Europe/Rome, Marco Baroni wrote:

Dear all,

I am new to (explicit) unicode handling, and right now I am facing this problem.

I have some data (lots of data) that in theory should be in ascii (with entity references in place of non-ascii characters). I have no easy way to get to know exactly how these data were generated.

When running the following script:

*********************************************************************** *****
#!/usr/bin/perl -w
binmode(STDOUT,":utf8");

while (<>) {

# ...

    $line = $_;

# ...

# line 22 follows:
    $line =~ s/([,;:!\?\)\]])/ $1/g;
    $line =~ s/(\.?<\/)/ $1/g;

#...

}

# ...
*********************************************************************** *****

I get thousands of warnings like these (they only pertain to line 22, not line 23 or other lines):

Malformed UTF-8 character (unexpected continuation byte 0xb0, with no preceding start byte) in substitution (s///) at look_for_probable_nprs.pl line 22, <> line 86851.

Malformed UTF-8 character (unexpected continuation byte 0xb0, with no preceding start byte) in substitution iterator at look_for_probable_nprs.pl line 22, <> line 339770.

Malformed UTF-8 character (unexpected non-continuation byte 0x6f, immediately after start byte 0xe7) in substitution iterator at look_for_probable_nprs.pl line 22, <> line 754455.

I looked at a few of the corresponding lines, and they all have some character that is beyond the ASCII range, and that was not converted into an entity reference (for example, a c with cedilla, and the > like).

The same script does not complain about the unicode characters that I insert (using the \x{HEX} notation).

Commenting out  the

binmode(STDOUT,":utf8");

line does not change anything (I get the same type of warning).

I run perl v5.8.0 on Mac OS X 10.2.6.

My questions are:

1) What is going on? Is there some documentation I can read that would make me understand and perhaps fix the problem? (A google search did not return anything that seemed particularly illuminating.)

2) The worse that can happen is that the incriminated characters will not be handled properly by the reg exp on line 22 (which should be ignoring them anyway), or is this a cue that worse things are going > on?

Thanks a lot!

Regards,

Marco


---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni


---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni

<Prev in Thread] Current Thread [Next in Thread>