perl-unicode

Re: Workaround to a unicode bug needed

2010-09-06 11:31:23
You need to have a 'use utf8;' statement at the beginning of your program to tell Perl that it is encoded in utf8.

I tested it with that, and it works.

Pierre Nugues wrote:
Dear All,

I wrote a simple tokenizer for texts containing Latin9 characters. It does not 
behave as expected with the Swedish text below and I would like to find a 
workaround.

More precisely, perl does not remove properly the Swedish quotes: » 
(RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, U+00BB) from the text. See the first 
character of the first line of this text.

When I run the program on a Mac Snow Leopard, with version 5.8.8 on the text encoded 
in UTF-8, Perl outputs a defective UTF-8 code for this character: <BB>
I could solve the problem by removing the û character from the tr// list (LATIN 
SMALL LETTER U WITH CIRCUMFLEX, U+00FB.)
Do you know of a better, cleaner way to work around this bug?

Thank you for your help
Pierre
--

### The Perl Program
### An elementary tokenizer. Save it in UTF-8
__BEGIN

while ($line = <>) { $text .= $line;
}
$text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
  # The dash character must be quoted
$text =~ s/([,.?!:;()'\-])/\n$1\n/g;
$text =~ s/\n+/\n/g;
print $text;

___END

### The text to reproduce the bug. Save it in UTF-8

___BEGIN
»Tjuvgömmare!» säga skatorna och se ut som samvetet självt. »Vi äro polisbetjänter, vi. Hit med tjuvgodset!» »Å, tyst, era rackare! Jag är gårdsfogden.» »Just den rätta!» håna de. ___END
--
Pierre Nugues, Lunds Tekniska Högskola, Institutionen för datavetenskap, Box 118, 
S-221 00 Lund, Suède.
Tél. (0046) 46 222 96 40, http://www.cs.lth.se/~pierre
Visiteurs: Lunds Tekniska Högskola, E-huset, rum 4134A, Ole Römers väg 3, S-223 
63 Lund.
Mon livre/My book: http://www.cs.lth.se/home/Pierre_Nugues/ilppp/