On Feb 18, 2004, at 09:19, Marc Langer wrote:
Hello,
Thanks for your report.
the following example code produces a wrong RFC2047 encoded string:
use Encode qw(encode);
my $string = Encode::encode("UTF-8",
"ääääääääääääääääääääääääääääääääääää");
print Encode::encode("MIME-Q", $string), "\n";
The output is:
=?UTF-8?Q?
=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3?=
=?UTF-8?Q?
=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=
=?UTF-8?Q?
=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3?=
=?UTF-8?Q?=A4=C3=A4=C3=A4=C3=A4=C3=A4?=
It LOOKS LIKE a bug but it is not. To demonstrate it is not, consider
the following scripts;
use Encode qw(encode);
use utf8;
my $string = "ääääääääääääääääääääääääääääääääääää";
print Encode::encode("MIME-Q", $string), "\n";
It prints;
=?UTF-8?Q?
=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=
=?UTF-8?Q?
=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=
=?UTF-8?Q?
=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=
=?UTF-8?Q?=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4=C3=A4?=
Now comment out "use utf8;" and the printout will the same as your
script.
RFC 2047 states:
A multi-octet character may not be split across adjacent
'encoded-word's.
So there must not be a line break after =C3 in the first and third
line.
When presenting such a wrong encoded header to MUAs like mutt or Gnus
they show question marks instead of the real characters at the position
of the wrong line break.
Why that happens is that you use Encode::encode("UTF-8"). That means
the utf8 flag of the resulting $string is OFF. Without UTF-8 flag
perl takes it as "\xA4\xC3", not "\N{LATIN CAPITAL LETTER A WITH
TILDE}". Now try the code below.
use Encode qw(encode);
my $string = Encode::decode("UTF-8",
"ääääääääääääääääääääääääääääääääääää");
# ^^^^^^
print Encode::encode("MIME-Q", $string), "\n";
Now it works correctly. From perl's point of view, it is decode(), not
encode(), to make it INTERNALLY UTF-8.
To avoid this kind of confusion, I recommend that you simply use utf8;
and not use Encode::decode("UTF-8").
Dan the Encode Maintainer