Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
and Encode::encode("UTF-8", ...) should be used instead. Also opening
file should be done by :encoding(UTF-8) layer instead :utf8.
But UTF-8 strict implementation in Encode module is horrible slow when
comparing to utf8::encode(). It is implemented in Encode.xs file and for
benchmarking can be this XS implementation called directly by:
use Encode;
my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
(without overhead of Encode module...)
Here are my results on 160 bytes long input string:
Encode::utf8::encode_xs({strict_utf8 => 1}, ...): 8 wallclock secs ( 8.56
usr + 0.00 sys = 8.56 CPU) @ 467289.72/s (n=4000000)
Encode::utf8::encode_xs({strict_utf8 => 0}, ...): 1 wallclock secs ( 1.66
usr + 0.00 sys = 1.66 CPU) @ 2409638.55/s (n=4000000)
utf8::encode: 1 wallclock secs ( 0.39 usr + 0.00 sys = 0.39 CPU) @
10256410.26/s (n=4000000)
I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
and did some optimizations. Final results are:
Encode::utf8::encode_xs({strict_utf8 => 1}, ...): 2 wallclock secs ( 3.27
usr + 0.00 sys = 3.27 CPU) @ 1223241.59/s (n=4000000)
Encode::utf8::encode_xs({strict_utf8 => 0}, ...): 1 wallclock secs ( 1.68
usr + 0.00 sys = 1.68 CPU) @ 2380952.38/s (n=4000000)
utf8::encode: 1 wallclock secs ( 0.40 usr + 0.00 sys = 0.40 CPU) @
10000000.00/s (n=4000000)
Patches are on github at pull request:
https://github.com/dankogai/p5-encode/pull/56
I would like if somebody review my patches and tell if this is the
right way for optimizations...