Re: Workaround to a unicode bug needed

Pierre Nugues schrieb am 06.09.2010 um 11:09 (+0200):

I wrote a simple tokenizer for texts containing Latin9 characters.


You probably mean "non-ASCII characters". Latin9 alias ISO-8859-15 is
the encoding. It's worth while making a distinction here.

When I run the program on a Mac Snow Leopard, with version 5.8.8 on
the text encoded in UTF-8, Perl outputs a defective UTF-8 code for
this character: <BB>


As someone said, given correct input acquisition, the output depends on
your binmode setting and your terminal. Use ":utf8" or "encoding(UTF-8)"
for a UTF-8 terminal.

### An elementary tokenizer. Save it in UTF-8


Then it should have "use utf8" indeed.

__BEGIN

while ($line = <>) { 
   $text .= $line;
}


If your data is text in UTF-8, you'll want to set your filehandle
(STDIN in this case) to UTF-8:

  binmode STDIN, 'encoding(UTF-8)'; # 'utf8' also works but less correct

$text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
  # The dash character must be quoted
$text =~ s/([,.?!:;()'\-])/\n$1\n/g;
$text =~ s/\n+/\n/g;
print $text;


I don't know your goal, but consider that \w in a regular expression
works fine to catch words:

          \,,,/
          (o o)
------oOOo-(_)-oOOo------
use strict;
use warnings;
my $fn = shift or die "Datei!"; # file in some encoding
open my $fh, '<encoding(iso-8859-15)', $fn or die "open $fn: $!";
my $txt = do { local $/; <$fh> };
close $fh;
binmode STDOUT, ':utf8'; # UTF-8 terminal
print $txt;
my @words = $txt =~ m/\w+/gms;
print $_, "\n" for @words;

-- 
Michael Ludwig