Re: Workaround to a unicode bug needed

Pierre Nugues wrote:

Dear Michael,

Pierre Nugues schrieb am 06.09.2010 um 11:09 (+0200):

I wrote a simple tokenizer for texts containing Latin9 characters.

You probably mean "non-ASCII characters". Latin9 alias ISO-8859-15 is
the encoding. It's worth while making a distinction here.

I meant what I wrote. The text contains a subset of the Latin9 characters. The 
list I consider in tr// is:
a-zåàâäæçéèêëîïôöœßùüûÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ

It is not a subset of Latin1, because of œ and Œ.The encoding of the text is UTF-8

When I run the program on a Mac Snow Leopard, with version 5.8.8 on
the text encoded in UTF-8, Perl outputs a defective UTF-8 code for
this character: <BB>

As someone said, given correct input acquisition, the output depends on
your binmode setting and your terminal. Use ":utf8" or "encoding(UTF-8)"
for a UTF-8 terminal.

I tried different combinations of binmode input and output and none of them 
works. Here is my terminal configuration:
Pierres-MBP-3:ch04 pierre$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=

The setting LC_ALL='C' does not work either.

### An elementary tokenizer. Save it in UTF-8

Then it should have "use utf8" indeed.

I tried this too.

If your data is text in UTF-8, you'll want to set your filehandle
(STDIN in this case) to UTF-8:

 binmode STDIN, 'encoding(UTF-8)'; # 'utf8' also works but less correct

This does not work either.

I don't know your goal, but consider that \w in a regular expression
works fine to catch words:

I know there are work around, but I would like to find one for tr//
Pierre

Attached are two files. One is my revision of your program that seemedto work on my machine. The other is the output, so you can verify thatit worked.

I ran it on 5.8.12, which is the only 5.8 version I have installed, andalso on the latest 5.13.4 blead. I got the same results with both.Instead of worrying about the I/O, I had just put your data into thesame utf8 program. I did use the -CO flag on the perl command line toget utf8 output.

If this program works for you, the problem is either in the I/O or it isa bug in your perl that was fixed by 5.8.12. If it doesn't work foryou, then it seems that it is a bug in your perl; and I don't know howto work around it. You could use a Latin9 locale with a "use locale"and not encode in utf8.

workaround.pl
Description: Perl program


TjuvgÃ¶mmare
!
sÃ¤ga
skatorna
och
se
ut
som
samvetet
sjÃ¤lvt
.
Vi
Ã¤ro
polisbetjÃ¤nter
,
vi
.
Hit
med
tjuvgodset
!
Ã?
,
tyst
,
era
rackare
!
Jag
Ã¤r
gÃ¥rdsfogden
.
Just
den
rÃ¤tta
!
hÃ¥na
de
.