perl-unicode

Re: Encode UTF-8 optimizations

2016-08-22 15:48:08
On Monday 22 August 2016 21:43:59 Karl Williamson wrote:
On 08/22/2016 07:05 AM, pali(_at_)cpan(_dot_)org wrote:
On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
On 08/21/2016 02:34 AM, pali(_at_)cpan(_dot_)org wrote:
On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
Top posting.

Attached is my alternative patch.  It effectively uses a different
algorithm to avoid decoding the input into code points, and to copy
all spans of valid input at once, instead of character at a time.

And it uses only currently available functions.

And that's the problem. As already wrote in previous email, calling
function from shared library cannot be heavy optimized as inlined
function and cause slow down. You are calling is_utf8_string_loc for
non-strict mode which is not inlined and so encode/decode of non-strict
mode will be slower...

And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
cannot be inlined too...

Therefore I think this is not good approach...


Then you should run your benchmarks to find out the performance.

You are right, benchmarks are needed to show final results.

On valid input, is_utf8_string_loc() is called once per string.  The
function call overhead and non-inlining should be not noticeable.

Ah right, I misread it as it is called per one valid sequence, not for
whole string. You are right.

It is called once per valid sequence.  See below.


On valid input, is_utf8_char_slow() is never called.  The used-parts can be
inlined.

Yes, but this function is there to be called primary on unknown input
which does not have to be valid. If I know that input is valid then
utf8::encode/decode is enough :-)

What process_utf8() does is to copy the alleged UTF-8 input to the 
output, verifying along the way that it actually is legal UTF-8 (with 2 
levels of strictness, depending on the input parameter), and taking some 
actions (exactly what depends on other input parameters) if and when it 
finds invalid UTF-8.

The way it works after my patch is like an instruction pipeline.  You 
start it up, and it stays in the pipeline as long as the next character 
in the input is legal or it reaches the end.  When it finds illegal 
input, it drops out of the pipeline, handles that, and starts up the 
pipeline to process any remaining input.  If the entire input string is 
valid, a single instance of the pipeline is all that gets invoked.

Yes, I figured out how it works.

The use-case I envision is that the input is supposed to be valid UTF-8, 
and the purpose of process_utf8() is to verify that that is in fact 
true, and to take specified actions when it isn't.

Right!

Under that use-case, 
taking longer to deal with invalid input is not a problem.  If that is 
not your use-case, please explain what yours is.

Basically Encode::decode("UTF-8", $input) is used for converting "untrusted" 
input 
sequence (e.g. from network or local file) to perl Unicode scalar. And if input 
contains something invalid, then Encode::decode do anything needed to return 
valid 
Unicode string (= replace invalid subsequences by Unicode replacement 
character). So 
Encode::decode("UTF-8", $input) is there for processing any input sequences, 
not only 
valid, also broken or totally invalid.

And I think you misunderstand when is_utf8_char_slow() is called.  It is 
called only when the next byte in the input indicates that the only 
legal UTF-8 that might follow would be for a code point that is at least 
U+200000, almost twice as high as the highest legal Unicode code point. 
It is a Perl extension to handle such code points, unlike other 
languages.  But the Perl core is not optimized for them, nor will it be. 
  My point is that is_utf8_char_slow() will only be called in very 
specialized cases, and we need not make those cases have as good a 
performance as normal ones.

In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in 
strict 
mode such sequence must be always invalid (it is above last valid Unicode 
character) 
This is what I tried to tell.

And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which 
could call 
is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().

So maybe it could make sense to provide some "strict" version of isUTF8_CHAR() 
macro as 
it such strict version does not have to call is_utf8_char_slow().

On invalid input, performance should be a minor consideration.

See below...

See above. :)


The inner loop is much tighter in both functions; likely it can be held in
the cache.  The algorithm avoids a bunch of work compared to the previous
one.

Right, for valid input algorithm is really faster. If it is because of
less case misses... maybe... I can play with perf or another tool to
look what is bottle neck now.

I doubt that it will be slower than that.  The only way to know in any
performance situation is to actually test.  And know that things will be
different depending on the underlying hardware, so only large differences
are really significant.

So, here are my test results. You can say that they are subjective, but
I would be happy if somebody provide better input for performance tests.

Abbreviations:
strict = Encode::encode/decode "UTF-8"
lax = Encode::encode/decode "utf8"
int = utf8::encode/decode
orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d

Test cases:
all = join "", map { chr } 0 .. 0x10FFFF
short = "žluťoučký kůň pěl ďábelské ódy " x 45
long = $short x 1000
invalid-short = "\xA0" x 1000
invalid-long = "\xA0" x 1000000

Encoding was called on string with Encode::_utf8_on() flag.


Rates:

encode:
                   all       short     long  invalid-short invalid-long
orig - strict      41/s    124533/s    132/s     115197/s        172/s
your - strict     176/s    411523/s    427/s      54813/s         66/s
my   - strict      80/s    172712/s    186/s     113787/s        138/s

orig - lax       1010/s   3225806/s   6250/s     546800/s       5151/s
your - lax        952/s   3225806/s   5882/s     519325/s       4919/s
my   - lax       1060/s   3125000/s   6250/s     645119/s       5009/s

orig - int    8154604/s  10000000/s    infty    9787566/s    9748151/s
your - int    9135243/s  11111111/s    infty    8922821/s    9737657/s
my   - int    9779395/s  10000000/s    infty    9822046/s    8949861/s


decode:
                   all       short     long  invalid-short invalid-long
orig - strict      39/s    119048/s    131/s     108574/s        171/s
your - strict     173/s    353357/s    442/s      42440/s         55/s
my   - strict      69/s    166667/s    182/s     117291/s        135/s

orig - lax         39/s    123609/s    137/s     127302/s        172/s
your - lax        230/s    393701/s    495/s      37346/s         65/s
my   - lax         79/s    158983/s    180/s     121456/s        138/s

orig - int        274/s    546448/s    565/s    8219513/s      12357/s
your - int        273/s    540541/s    562/s    7226066/s      12948/s
my   - int        274/s    543478/s    562/s    8502902/s      12421/s


int is there just for verifications of tests as utf8::encode/decode
functions was not changed.

Results are: your patch is faster for valid sequences (as you wrote
above), but slower for invalid (in some cases radically).

So I would propose two optimizations:

1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new
   which does not call utf8n_to_uvchr. That call is not needed as in
   that case sequence is already invalid.

And an optimizing compiler should figure that out and omit the call, so 
we shouldn't have to manually.  In the next few months I will be working 
on some fixes to the :utf8 layer.  That could lead to a core function 
similar to this, but without relying on the compiler to figure things out.

Great! But :utf8 layer is cannot be used for reading "untrusted" arbitrary 
files... 

2) Try to make inline version of function is_utf8_string_loc(). Maybe
   merge with is_strict_utf8_string_loc()? That should boost non strict
   decoder for invalid sequences (now it is slower then strict one).

I'll look at which functions in utf8.c should be inlined.  This is a 
candidate.  But I doubt that that is why this is slower in this case. 
Read blead's perlhack.pod for performance testing tool hints.

Your comments caused me to realize that the call to utf8n_to_uvchr() 
when the input is syntactically valid, but is an illegal code point 
(like a surrogate, etc) could be replaced by the faster 
valid_utf8_to_uvchr(), which has not been in the public API.  I have 
pushed to

http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode

a version of blead which has this function made public, and inlined, in 
case you want to test with it.

There are some things in the error handling code that could be improved 
upon.  For example, memcpy() of a single character is slower than just 
copying a byte at a time, as the function call overhead dwarfs the savings.

This one memcpy in OK. It is called on constant string and I checked that gcc 
4.6 with 
-O2 on x86_64 already replace that call by instructions. No memcpy function 
call is 
used.

But maybe it could be rewritten... no idea if other compilers have 
optimizations for 
strlen/strcpy/memcpy functions on constant strings and in case when length is 
known.


And maybe it could make sense make all needed functions as part of
public API.

Yes


Are you going to prepare pull request for Encode module?

I will, after everything is settled.  This may include changes to 
Devel::PPPort to make sure this works on all perls that Encode supports.

However, I think it would be good to get the extra tests of malformed 
inputs into the distribution.  (I haven't looked at your PR yet for 
flaws.  I'll do that, and hopefully will find none, so will recommend 
your PR be pulled.)

I added some tests for overlong sequences. Only for ASCII platforms, tests for 
EBCDIC 
are missing (sorry, I do not have access to any EBCDIC platform for testing).

Anyway, how it behave on EBCDIC platforms? And maybe another question
what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
UTF-8 or to UTF-EBCDIC?

It works fine on EBCDIC platforms.  There are other bugs in Encode on 
EBCDIC that I plan on investigating as time permits.  Doing this has 
fixed some of these for free.  The uvuni() functions should in almost 
all instances be uvchr(), and my patch does that.

Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think 
that it 
should be different for UTF-EBCDIC.

On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if 
you prefer), so $str will effectively be in the version of UTF-EBCDIC 
valid for the platform it is running on (there are differences depending 
on the platform's underlying code page).

So it means that on EBCDIC platforms you cannot process file which is encoded 
in UTF-8? 
As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in 
UTF-8 (as I 
understood).