perl-unicode

Re: Workaround to a unicode bug needed

2010-09-06 16:08:06
Pierre Nugues wrote:
Dear Michael,

Pierre Nugues schrieb am 06.09.2010 um 11:09 (+0200):
I wrote a simple tokenizer for texts containing Latin9 characters.
You probably mean "non-ASCII characters". Latin9 alias ISO-8859-15 is
the encoding. It's worth while making a distinction here.
I meant what I wrote. The text contains a subset of the Latin9 characters. The 
list I consider in tr// is:
a-zåàâäæçéèêëîïôöœßùüûÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ

It is not a subset of Latin1, because of œ and Œ. The encoding of the text is UTF-8

When I run the program on a Mac Snow Leopard, with version 5.8.8 on
the text encoded in UTF-8, Perl outputs a defective UTF-8 code for
this character: <BB>
As someone said, given correct input acquisition, the output depends on
your binmode setting and your terminal. Use ":utf8" or "encoding(UTF-8)"
for a UTF-8 terminal.
I tried different combinations of binmode input and output and none of them 
works. Here is my terminal configuration:
Pierres-MBP-3:ch04 pierre$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=

The setting LC_ALL='C' does not work either.

### An elementary tokenizer. Save it in UTF-8
Then it should have "use utf8" indeed.
I tried this too.

If your data is text in UTF-8, you'll want to set your filehandle
(STDIN in this case) to UTF-8:

 binmode STDIN, 'encoding(UTF-8)'; # 'utf8' also works but less correct
This does not work either.

I don't know your goal, but consider that \w in a regular expression
works fine to catch words:
I know there are work around, but I would like to find one for tr//
Pierre



Attached are two files. One is my revision of your program that seemed to work on my machine. The other is the output, so you can verify that it worked.

I ran it on 5.8.12, which is the only 5.8 version I have installed, and also on the latest 5.13.4 blead. I got the same results with both. Instead of worrying about the I/O, I had just put your data into the same utf8 program. I did use the -CO flag on the perl command line to get utf8 output.

If this program works for you, the problem is either in the I/O or it is a bug in your perl that was fixed by 5.8.12. If it doesn't work for you, then it seems that it is a bug in your perl; and I don't know how to work around it. You could use a Latin9 locale with a "use locale" and not encode in utf8.

Attachment: workaround.pl
Description: Perl program


Tjuvgömmare
!
säga
skatorna
och
se
ut
som
samvetet
självt
.
Vi
äro
polisbetjänter
,
vi
.
Hit
med
tjuvgodset
!
Ã?
,
tyst
,
era
rackare
!
Jag
är
gårdsfogden
.
Just
den
rätta
!
håna
de
.