Re: Bug in Encode::encode("MIME-Q", $string)

On Feb 18, 2004, at 09:19, Marc Langer wrote:

Hello,


Thanks for your report.

the following example code produces a wrong RFC2047 encoded string:

use Encode qw(encode);
my $string = Encode::encode("UTF-8","ääääääääääääääääääääääääääääääääääää");
print Encode::encode("MIME-Q", $string), "\n";

The output is:
=?UTF-8?Q?=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3?==?UTF-8?Q?=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?==?UTF-8?Q?=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3?=
 =?UTF-8?Q?=A4=C3=A4=C3=A4=C3=A4=C3=A4?=

It LOOKS LIKE a bug but it is not. To demonstrate it is not, considerthe following scripts;


use Encode qw(encode);
use utf8;
my $string = "ääääääääääääääääääääääääääääääääääää";
print Encode::encode("MIME-Q", $string), "\n";

It prints;

=?UTF-8?Q?=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?==?UTF-8?Q?=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?==?UTF-8?Q?=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=
 =?UTF-8?Q?=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=

Now comment out "use utf8;" and the printout will the same as yourscript.

RFC 2047 states:
A multi-octet character may not be split across adjacent'encoded-word's.
So there must not be a line break after =C3 in the first and thirdline.
When presenting such a wrong encoded header to MUAs like mutt or Gnus
they show question marks instead of the real characters at the position
of the wrong line break.

Why that happens is that you use Encode::encode("UTF-8"). That meansthe utf8 flag of the resulting $string is OFF. Without UTF-8 flagperl takes it as "\xA4\xC3", not "\N{LATIN CAPITAL LETTER A WITHTILDE}". Now try the code below.


use Encode qw(encode);

my $string = Encode::decode("UTF-8","ääääääääääääääääääääääääääääääääääää");

#                    ^^^^^^
print Encode::encode("MIME-Q", $string), "\n";

Now it works correctly. From perl's point of view, it is decode(), notencode(), to make it INTERNALLY UTF-8.

To avoid this kind of confusion, I recommend that you simply use utf8;and not use Encode::decode("UTF-8").


Dan the Encode Maintainer