perl-unicode

Re: Bug in Encode::encode("MIME-Q", $string)

2004-02-19 00:30:06
On Feb 18, 2004, at 09:19, Marc Langer wrote:
Hello,

Thanks for your report.

the following example code produces a wrong RFC2047 encoded string:

use Encode qw(encode);
my $string = Encode::encode("UTF-8", "ääääääääääääääääääääääääääääääääääää");
print Encode::encode("MIME-Q", $string), "\n";

The output is:

=?UTF-8?Q? =C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3?= =?UTF-8?Q? =A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?= =?UTF-8?Q? =C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3?=
 =?UTF-8?Q?=A4=C3=A4=C3=A4=C3=A4=C3=A4?=

It LOOKS LIKE a bug but it is not. To demonstrate it is not, consider the following scripts;

use Encode qw(encode);
use utf8;
my $string = "ääääääääääääääääääääääääääääääääääää";
print Encode::encode("MIME-Q", $string), "\n";

It prints;

=?UTF-8?Q? =C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?= =?UTF-8?Q? =C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?= =?UTF-8?Q? =C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=
 =?UTF-8?Q?=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=

Now comment out "use utf8;" and the printout will the same as your script.

RFC 2047 states:

A multi-octet character may not be split across adjacent 'encoded-word's.

So there must not be a line break after =C3 in the first and third line.
When presenting such a wrong encoded header to MUAs like mutt or Gnus
they show question marks instead of the real characters at the position
of the wrong line break.

Why that happens is that you use Encode::encode("UTF-8"). That means the utf8 flag of the resulting $string is OFF. Without UTF-8 flag perl takes it as "\xA4\xC3", not "\N{LATIN CAPITAL LETTER A WITH TILDE}". Now try the code below.

use Encode qw(encode);
my $string = Encode::decode("UTF-8", "ääääääääääääääääääääääääääääääääääää");
#                    ^^^^^^
print Encode::encode("MIME-Q", $string), "\n";

Now it works correctly. From perl's point of view, it is decode(), not encode(), to make it INTERNALLY UTF-8.

To avoid this kind of confusion, I recommend that you simply use utf8; and not use Encode::decode("UTF-8").

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>