Tim Bunce <Tim(_dot_)Bunce(_at_)pobox(_dot_)com> writes:
On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote:
As you probably know perl's version of UTF-8 is not the real thing. I
thought I would hack up a patch to support the encoding as defined by
Unicode. That involves rejecting illegal chars (like surrogates,
"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
and such.
It's worth remembering that overlong sequences are a potential security risk.
The current Encode utf8 decoder already refuse these as this is one of
the things that perl's internal is_utf8_char() actually check for.
The current encoder does not check anything so it might emit overlong
sequences.
The ':utf8' layer does not check its input and is happy to accept
overlong sequences. It just slaps on the SvUTF8 flag. Using
':encoding(utf8)' layer instead will reject these since this will
invoke Encode.
While checking this out I found that Data::Dumper will actually
segfault when given overlong UTF-8. There are probably other issues
like this to be found if you start looking.
bash-2.05b$ perl -v
This is perl, v5.8.6 built for i686-linux
Copyright 1987-2004, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'. If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
bash-2.05b$ cat xxx.pl
if (@ARGV) {
print "Hi\n";
if ($ARGV[0] eq "encoding") {
binmode(STDIN, ':encoding(utf8)');
}
elsif ($ARGV[0] eq "utf8") {
binmode(STDIN, ':utf8');
}
my $data = <STDIN>;
use Data::Dumper;
print Dumper($data);
}
else {
print "foo\xf0\x80\x80\x80bar\n";
}
bash-2.05b$ perl xxx.pl | perl xxx.pl raw
Hi
$VAR1 = 'fooðbar
';
bash-2.05b$ perl xxx.pl | perl xxx.pl encoding
Hi
utf8 "\xF0" does not map to Unicode at xxx.pl line 10.
utf8 "\xF0" does not map to Unicode at xxx.pl line 10.
$VAR1 = 'foo';
bash-2.05b$ perl xxx.pl | perl xxx.pl utf8
Hi
Segmentation fault
Valgrind says:
==17318==
==17318== Invalid write of size 1
==17318== at 0x1B90A97B: esc_q_utf8 (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x1B90CDB0: DD_dump (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x1B90DFC5: XS_Data__Dumper_Dumpxs (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x80B4FE0: Perl_pp_entersub (in /opt/perl/5.8.6/bin/perl)
==17318== Address 0x1BC1684B is 0 bytes after a block of size 11 alloc'd
==17318== at 0x1B9059FF: realloc (vg_replace_malloc.c:197)
==17318== by 0x809FB2A: Perl_safesysrealloc (in /opt/perl/5.8.6/bin/perl)
==17318== by 0x80B7664: Perl_sv_grow (in /opt/perl/5.8.6/bin/perl)
==17318== by 0x1B90A952: esc_q_utf8 (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==
==17318== Invalid write of size 1
==17318== at 0x1B90A984: esc_q_utf8 (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x1B90CDB0: DD_dump (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x1B90DFC5: XS_Data__Dumper_Dumpxs (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x80B4FE0: Perl_pp_entersub (in /opt/perl/5.8.6/bin/perl)
==17318== Address 0x1BC1684C is 1 bytes after a block of size 11 alloc'd
==17318== at 0x1B9059FF: realloc (vg_replace_malloc.c:197)
==17318== by 0x809FB2A: Perl_safesysrealloc (in /opt/perl/5.8.6/bin/perl)
==17318== by 0x80B7664: Perl_sv_grow (in /opt/perl/5.8.6/bin/perl)
==17318== by 0x1B90A952: esc_q_utf8 (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==
==17318== Invalid write of size 1
==17318== at 0x1B90A98A: esc_q_utf8 (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x1B90CDB0: DD_dump (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x1B90DFC5: XS_Data__Dumper_Dumpxs (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318== by 0x80B4FE0: Perl_pp_entersub (in /opt/perl/5.8.6/bin/perl)
==17318== Address 0x1BC1684D is 2 bytes after a block of size 11 alloc'd
==17318== at 0x1B9059FF: realloc (vg_replace_malloc.c:197)
==17318== by 0x809FB2A: Perl_safesysrealloc (in /opt/perl/5.8.6/bin/perl)
==17318== by 0x80B7664: Perl_sv_grow (in /opt/perl/5.8.6/bin/perl)
==17318== by 0x1B90A952: esc_q_utf8 (in
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==