Re: Workaround to a unicode bug needed

Dear Michael,

Pierre Nugues schrieb am 06.09.2010 um 11:09 (+0200):

I wrote a simple tokenizer for texts containing Latin9 characters.


You probably mean "non-ASCII characters". Latin9 alias ISO-8859-15 is
the encoding. It's worth while making a distinction here.

I meant what I wrote. The text contains a subset of the Latin9 characters. The 
list I consider in tr// is:
a-zåàâäæçéèêëîïôöœßùüûÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ

It is not a subset of Latin1, because of œ and Œ. 
The encoding of the text is UTF-8

When I run the program on a Mac Snow Leopard, with version 5.8.8 on
the text encoded in UTF-8, Perl outputs a defective UTF-8 code for
this character: <BB>


As someone said, given correct input acquisition, the output depends on
your binmode setting and your terminal. Use ":utf8" or "encoding(UTF-8)"
for a UTF-8 terminal.

I tried different combinations of binmode input and output and none of them 
works. Here is my terminal configuration:
Pierres-MBP-3:ch04 pierre$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=

The setting LC_ALL='C' does not work either.

### An elementary tokenizer. Save it in UTF-8


Then it should have "use utf8" indeed.

I tried this too.

If your data is text in UTF-8, you'll want to set your filehandle
(STDIN in this case) to UTF-8:

 binmode STDIN, 'encoding(UTF-8)'; # 'utf8' also works but less correct

This does not work either.

I don't know your goal, but consider that \w in a regular expression
works fine to catch words:

I know there are work around, but I would like to find one for tr//
Pierre