perl-unicode

Overlong UTF-8 (Re: Make Encode.pm support the real UTF-8)

2004-12-03 05:30:05
Tim Bunce <Tim(_dot_)Bunce(_at_)pobox(_dot_)com> writes:

On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote:
As you probably know perl's version of UTF-8 is not the real thing.  I
thought I would hack up a patch to support the encoding as defined by
Unicode.  That involves rejecting illegal chars (like surrogates,
"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
and such.

It's worth remembering that overlong sequences are a potential security risk.

The current Encode utf8 decoder already refuse these as this is one of
the things that perl's internal is_utf8_char() actually check for.

The current encoder does not check anything so it might emit overlong
sequences.

The ':utf8' layer does not check its input and is happy to accept
overlong sequences.  It just slaps on the SvUTF8 flag.  Using
':encoding(utf8)' layer instead will reject these since this will
invoke Encode.

While checking this out I found that Data::Dumper will actually
segfault when given overlong UTF-8.  There are probably other issues
like this to be found if you start looking.

bash-2.05b$ perl -v

This is perl, v5.8.6 built for i686-linux

Copyright 1987-2004, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'.  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

bash-2.05b$ cat xxx.pl
if (@ARGV) {
    print "Hi\n";
    if ($ARGV[0] eq "encoding") {
        binmode(STDIN, ':encoding(utf8)');
    }
    elsif ($ARGV[0] eq "utf8") {
        binmode(STDIN, ':utf8');
    }

    my $data = <STDIN>;

    use Data::Dumper;
    print Dumper($data);
}
else {
    print "foo\xf0\x80\x80\x80bar\n";
}
bash-2.05b$ perl xxx.pl | perl xxx.pl raw
Hi
$VAR1 = 'fooðbar
';
bash-2.05b$ perl xxx.pl | perl xxx.pl encoding
Hi
utf8 "\xF0" does not map to Unicode at xxx.pl line 10.
utf8 "\xF0" does not map to Unicode at xxx.pl line 10.
$VAR1 = 'foo';
bash-2.05b$ perl xxx.pl | perl xxx.pl utf8
Hi
Segmentation fault


Valgrind says:

==17318==
==17318== Invalid write of size 1
==17318==    at 0x1B90A97B: esc_q_utf8 (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x1B90CDB0: DD_dump (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x1B90DFC5: XS_Data__Dumper_Dumpxs (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x80B4FE0: Perl_pp_entersub (in /opt/perl/5.8.6/bin/perl)
==17318==  Address 0x1BC1684B is 0 bytes after a block of size 11 alloc'd
==17318==    at 0x1B9059FF: realloc (vg_replace_malloc.c:197)
==17318==    by 0x809FB2A: Perl_safesysrealloc (in /opt/perl/5.8.6/bin/perl)
==17318==    by 0x80B7664: Perl_sv_grow (in /opt/perl/5.8.6/bin/perl)
==17318==    by 0x1B90A952: esc_q_utf8 (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==
==17318== Invalid write of size 1
==17318==    at 0x1B90A984: esc_q_utf8 (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x1B90CDB0: DD_dump (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x1B90DFC5: XS_Data__Dumper_Dumpxs (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x80B4FE0: Perl_pp_entersub (in /opt/perl/5.8.6/bin/perl)
==17318==  Address 0x1BC1684C is 1 bytes after a block of size 11 alloc'd
==17318==    at 0x1B9059FF: realloc (vg_replace_malloc.c:197)
==17318==    by 0x809FB2A: Perl_safesysrealloc (in /opt/perl/5.8.6/bin/perl)
==17318==    by 0x80B7664: Perl_sv_grow (in /opt/perl/5.8.6/bin/perl)
==17318==    by 0x1B90A952: esc_q_utf8 (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==
==17318== Invalid write of size 1
==17318==    at 0x1B90A98A: esc_q_utf8 (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x1B90CDB0: DD_dump (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x1B90DFC5: XS_Data__Dumper_Dumpxs (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==    by 0x80B4FE0: Perl_pp_entersub (in /opt/perl/5.8.6/bin/perl)
==17318==  Address 0x1BC1684D is 2 bytes after a block of size 11 alloc'd
==17318==    at 0x1B9059FF: realloc (vg_replace_malloc.c:197)
==17318==    by 0x809FB2A: Perl_safesysrealloc (in /opt/perl/5.8.6/bin/perl)
==17318==    by 0x80B7664: Perl_sv_grow (in /opt/perl/5.8.6/bin/perl)
==17318==    by 0x1B90A952: esc_q_utf8 (in 
/opt/perl/5.8.6/lib/5.8.6/i686-linux/auto/Data/Dumper/Dumper.so)
==17318==